Feat: Update indexing of parquet dataset and also add streaming support to huggingface datasets#505
Merged
tchaton merged 35 commits intoLightning-AI:mainfrom Mar 20, 2025
Conversation
Collaborator
|
@bhimrazy I wonder if we could benchmark pyarrow vs polars for streaming the data |
Collaborator
Author
tchaton
reviewed
Mar 13, 2025
for more information, see https://pre-commit.ci
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #505 +/- ##
===================================
Coverage 79% 79%
===================================
Files 39 39
Lines 5859 5887 +28
===================================
+ Hits 4621 4648 +27
- Misses 1238 1239 +1 🚀 New features to boost your workflow:
|
Collaborator
Author
Benchmarks for open-thoughts/OpenThoughts-114k
a) Results for this PR with low_memory=True:⚡ main ~/litdata-benchmark SHUFFLE=0 PRELOAD=0 LOW_MEMORY=1 python stream_hf_dataset.py
Seed set to 42
Shuffle: False, Preload: False, Low Memory: True
Indexing HF dataset from hf://datasets/open-thoughts/OpenThoughts-114k/data into /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json.
Indexing progress: 0%| | 0/6 [00:00<?, ?step/s]
Index file successfully written to: /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json
Indexing progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 48.63step/s]
Total number of samples in the dataset: 113957
train-00001-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 175M/175M [00:00<00:00, 218MB/s]
train-00004-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 154M/154M [00:00<00:00, 193MB/s]
train-00003-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 174M/174M [00:00<00:00, 207MB/s]
train-00002-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 173M/173M [00:00<00:00, 226MB/s]
train-00005-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 152M/152M [00:00<00:00, 189MB/s]
train-00000-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 250M/250M [00:01<00:00, 135MB/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:08<00:00, 51.68it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 0, streamed over 113957 samples in 8.631164073944092 or 13202.965949617279 samples/sec.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:07<00:00, 61.00it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 1, streamed over 113957 samples in 7.311708211898804 or 15585.54473275472 samples/sec.
Finished benchmarking.b) Results for this PR with low_memory=False:⚡ main ~/litdata-benchmark SHUFFLE=0 PRELOAD=0 LOW_MEMORY=0 python stream_hf_dataset.py
Seed set to 42
Benchmarking using litdata version: 0.2.42
Shuffle: False, Preload: False, Low Memory: False
Indexing HF dataset from hf://datasets/open-thoughts/OpenThoughts-114k/data into /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json.
Indexing progress: 0%| | 0/6 [00:00<?, ?step/s]
Index file successfully written to: /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json
Indexing progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 23.32step/s]
Total number of samples in the dataset: 113957
train-00005-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 152M/152M [00:00<00:00, 234MB/s]
train-00004-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 154M/154M [00:00<00:00, 232MB/s]
train-00002-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 173M/173M [00:00<00:00, 226MB/s]
train-00003-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 174M/174M [00:00<00:00, 227MB/s]
train-00000-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 250M/250M [00:01<00:00, 205MB/s]
train-00001-of-00006.parquet: 100%|████████████████████████████████████████████████████████████████████████████| 175M/175M [00:02<00:00, 72.0MB/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:09<00:00, 47.54it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 0, streamed over 113957 samples in 9.382962703704834 or 12145.093821165492 samples/sec.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:08<00:00, 50.02it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 1, streamed over 113957 samples in 8.917301416397095 or 12779.310479351509 samples/sec.
Finished benchmarking.c) Results for this PR with low_memory=False and pre_load_chunk=True:⚡ main ~/litdata-benchmark SHUFFLE=0 PRELOAD=1 LOW_MEMORY=0 python stream_hf_dataset.py
Seed set to 42
Shuffle: False, Preload: True, Low Memory: False
Indexing HF dataset from hf://datasets/open-thoughts/OpenThoughts-114k/data into /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json.
Indexing progress: 0%| | 0/6 [00:00<?, ?step/s]
Index file successfully written to: /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json
Indexing progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 42.90step/s]
Total number of samples in the dataset: 113957
train-00004-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 154M/154M [00:00<00:00, 221MB/s]
train-00005-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 152M/152M [00:00<00:00, 216MB/s]
train-00002-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 173M/173M [00:00<00:00, 224MB/s]
train-00003-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 174M/174M [00:00<00:00, 219MB/s]
train-00001-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 175M/175M [00:00<00:00, 203MB/s]
train-00000-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 250M/250M [00:01<00:00, 233MB/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:09<00:00, 45.82it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 0, streamed over 113957 samples in 9.733589887619019 or 11707.597777877443 samples/sec.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:10<00:00, 41.50it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 1, streamed over 113957 samples in 10.748220205307007 or 10602.402993807767 samples/sec.
Finished benchmarking.d) Results for this PR with low_memory=False, pre_load_chunk=True and shuffle=True:⚡ main ~/litdata-benchmark SHUFFLE=1 PRELOAD=1 LOW_MEMORY=0 python stream_hf_dataset.py
Seed set to 42
Shuffle: True, Preload: True, Low Memory: False
Indexing HF dataset from hf://datasets/open-thoughts/OpenThoughts-114k/data into /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json.
Indexing progress: 0%| | 0/6 [00:00<?, ?step/s]
Index file successfully written to: /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json
Indexing progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 12.96step/s]
Total number of samples in the dataset: 113957
train-00004-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 154M/154M [00:00<00:00, 221MB/s]
train-00002-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 173M/173M [00:00<00:00, 218MB/s]
train-00005-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 152M/152M [00:00<00:00, 191MB/s]
train-00001-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 175M/175M [00:00<00:00, 217MB/s]
train-00003-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 174M/174M [00:00<00:00, 192MB/s]
train-00000-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 250M/250M [00:01<00:00, 214MB/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:10<00:00, 44.21it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 0, streamed over 113957 samples in 10.087934494018555 or 11296.361705695894 samples/sec.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:10<00:00, 41.31it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 1, streamed over 113957 samples in 10.79666018486023 or 10554.834578638292 samples/sec.
Finished benchmarking.Benchmarks for HuggingFaceFW/fineweb-edu 10BT sample
a) Results for this PR with low_memory=True:⚡ main ~/litdata-benchmark DATASET=1 SHUFFLE=0 PRELOAD=0 LOW_MEMORY=1 python stream_hf_dataset.py
Seed set to 42
Benchmarking using litdata version: 0.2.42
Shuffle: False, Preload: False, Low Memory: True
Dataset: hf://datasets/HuggingFaceFW/fineweb-edu/sample/10BT
Indexing HF dataset from hf://datasets/HuggingFaceFW/fineweb-edu/sample/10BT into /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/137494f2a7a4a992c32c1f0beab95c60c7933474ba5
7c5a17f304dc97347a732/index.json.
Indexing progress: 0%| | 0/14 [00:00<?, ?step/s]'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 5c7d2c3a-6572-46bf-9d9e-7177ec4138e3)')' thrown while requesting GET https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/resolve/main/sample/10BT/004_00000.parquet
Retrying in 1s [Retry 1/5].
Indexing progress: 36%|███████████████████████████████▊ | 5/14 [00:11<00:20, 2.24s/step]
Index file successfully written to: /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/137494f2a7a4a992c32c1f0beab95c60c7933474ba57c5a17f304dc97347a732/index.json
Indexing progress: 100%|████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:11<00:00, 1.25step/s]
Total number of samples in the dataset: 9672101
013_00000.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████| 541M/541M [00:02<00:00, 228MB/s]
011_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:10<00:00, 203MB/s]
002_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:10<00:00, 202MB/s]
003_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:10<00:00, 202MB/s]
004_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:10<00:00, 201MB/s]
001_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:10<00:00, 199MB/s]
000_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:11<00:00, 193MB/s]
012_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:11<00:00, 191MB/s]
009_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:11<00:00, 186MB/s]
005_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:11<00:00, 182MB/s]
008_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:12<00:00, 179MB/s]
010_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:12<00:00, 177MB/s]
006_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:12<00:00, 176MB/s]
007_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:12<00:00, 172MB/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 37782/37782 [04:25<00:00, 142.39it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 0, streamed over 9672101 samples in 265.3391079902649 or 36451.84810564034 samples/sec.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 37782/37782 [03:38<00:00, 172.91it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 1, streamed over 9672101 samples in 218.5037910938263 or 44265.1395731683 samples/sec. |
tchaton
approved these changes
Mar 20, 2025
Collaborator
tchaton
left a comment
There was a problem hiding this comment.
Simply put, fantastic work !
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


What does this PR do?
This PR introduces the following improvements:
HFDownloaderto usehf_hub_download, which can be combined withhf_transferfor faster downloads.Fixes #502.
Example usage:
Benchmarks
Using LitData
These benchmarks were generated using this script with the following settings:
batch_size = 256,num_workers = 32, andmachine = A10G. The results may vary slightly across different runs.Dataset: OpenThoughts-114k (3.55 GB)
Dataset: fineweb-edu (10BT Sample) (~26 GB)
Using huggingface datasets streaming
These benchmarks were generated using this script with the following settings:
batch_size = 256,num_workers = 32, andmachine = A10G. The results may vary slightly across different runs.Dataset: OpenThoughts-114k (3.55 GB)
Dataset: fineweb-edu (10BT Sample) (~26 GB)
PR Review
Community members are welcome to review this PR once tests have passed.
If your PR was not previously discussed in GitHub issues, there's a high chance it will not be merged.
Did you have fun?
Yes, I did. 😊