Skip to content

Commit 2f06216

Browse files
authored
Update documentation on Streaming Parquet Datasets from Huggingface and other cloud providers (#523)
* update reamde docs for parquet dataset * remove print
1 parent 1c8ab3f commit 2f06216

File tree

2 files changed

+69
-26
lines changed

2 files changed

+69
-26
lines changed

README.md

Lines changed: 69 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -264,15 +264,35 @@ https://github.com/user-attachments/assets/3ba9e2ef-bf6b-41fc-a578-e4b4113a0e72
264264

265265
</details>
266266

267+
**Prerequisites:**
268+
269+
Install the required dependencies to stream Hugging Face datasets:
270+
```sh
271+
pip install "litdata[extra]" huggingface_hub
272+
273+
# Optional: To speed up downloads on high-bandwidth networks
274+
pip install hf_tansfer
275+
export HF_HUB_ENABLE_HF_TRANSFER=1
276+
```
277+
278+
**Stream Hugging Face dataset:**
279+
267280
```python
268281
import litdata as ld
269282

270-
hf_uri = "hf://datasets/leonardPKU/clevr_cogen_a_train/data"
283+
# Define the Hugging Face dataset URI
284+
hf_dataset_uri = "hf://datasets/leonardPKU/clevr_cogen_a_train/data"
271285

272-
ds = ld.StreamingDataset(hf_uri)
286+
# Create a streaming dataset
287+
dataset = ld.StreamingDataset(hf_dataset_uri)
273288

274-
for _ds in ds:
275-
print(f"{_ds[1]}; {_ds[2]}")
289+
# Print the first sample
290+
print("Sample", dataset[0])
291+
292+
# Stream the dataset using StreamingDataLoader
293+
dataloader = ld.StreamingDataLoader(dataset, batch_size=4)
294+
for sample in dataloader:
295+
pass
276296
```
277297

278298
You don’t need to worry about indexing the dataset or any other setup. **LitData** will **handle all the necessary steps automatically** and `cache` the `index.json` file, so you won't have to index it again.
@@ -288,12 +308,12 @@ If the Hugging Face dataset hasn't been indexed yet, you can index it first usin
288308
```python
289309
import litdata as ld
290310

291-
hf_uri = "hf://datasets/leonardPKU/clevr_cogen_a_train/data"
311+
hf_dataset_uri = "hf://datasets/leonardPKU/clevr_cogen_a_train/data"
292312

293-
ld.index_hf_dataset(hf_uri)
313+
ld.index_hf_dataset(hf_dataset_uri)
294314
```
295315

296-
- Indexing the Hugging Face dataset ahead of time will make streaming faster, as it avoids the need for real-time indexing during streaming.
316+
- Indexing the Hugging Face dataset ahead of time will make streaming abit faster, as it avoids the need for real-time indexing during streaming.
297317

298318
- To use `HF gated dataset`, ensure the `HF_TOKEN` environment variable is set.
299319

@@ -310,9 +330,9 @@ For full control over the cache path(`where index.json file will be stored`) and
310330
```python
311331
import litdata as ld
312332

313-
hf_uri = "hf://datasets/open-thoughts/OpenThoughts-114k/data"
333+
hf_dataset_uri = "hf://datasets/open-thoughts/OpenThoughts-114k/data"
314334

315-
ld.index_parquet_dataset(hf_uri, "hf-index-dir")
335+
ld.index_parquet_dataset(hf_dataset_uri, "hf-index-dir")
316336
```
317337

318338
2. To stream HF datasets now, pass the `HF dataset URI`, the path where the `index.json` file is stored, and `ParquetLoader` as the `item_loader` to the **`StreamingDataset`**:
@@ -321,18 +341,18 @@ ld.index_parquet_dataset(hf_uri, "hf-index-dir")
321341
import litdata as ld
322342
from litdata.streaming.item_loader import ParquetLoader
323343

324-
hf_uri = "hf://datasets/open-thoughts/OpenThoughts-114k/data"
344+
hf_dataset_uri = "hf://datasets/open-thoughts/OpenThoughts-114k/data"
325345

326-
ds = ld.StreamingDataset(hf_uri, item_loader=ParquetLoader(), index_path="hf-index-dir")
346+
dataset = ld.StreamingDataset(hf_dataset_uri, item_loader=ParquetLoader(), index_path="hf-index-dir")
327347

328-
for _ds in ds:
329-
print(f"{_ds[0]}; {_ds[1]}\n")
348+
for batch in ld.StreamingDataLoader(dataset, batch_size=4):
349+
pass
330350
```
331351

332352
&nbsp;
333353

334354
### LitData `Optimize` v/s `Parquet`
335-
355+
<!-- TODO: Update benchmark -->
336356
Below is the benchmark for the `Imagenet dataset (155 GB)`, demonstrating that **`optimizing the dataset using LitData is faster and results in smaller output size compared to raw Parquet files`**.
337357

338358
| **Operation** | **Size (GB)** | **Time (seconds)** | **Throughput (images/sec)** |
@@ -771,35 +791,59 @@ The `overwrite` mode will delete the existing data and start from fresh.
771791
<summary> ✅ Stream parquet datasets</summary>
772792
&nbsp;
773793

774-
You can stream Parquet datasets directly without the need to convert them into the LitData optimized binary format.
794+
Stream Parquet datasets directly with LitData—no need to convert them into LitData’s optimized binary format! If your dataset is already in Parquet format, you can efficiently index and stream it using `StreamingDataset` and `StreamingDataLoader`.
775795

776-
If your dataset is already in Parquet format, you can index and use it with StreamingDataset and DataLoader for efficient streaming.
796+
**Assumption:**
777797

778-
Assumption:
779798
Your dataset directory contains one or more Parquet files.
780799

781-
- **Index Parquet dataset**:
800+
**Prerequisites:**
801+
802+
Install the required dependencies to stream Parquet datasets from cloud storage like **Amazon S3** or **Google Cloud Storage**:
803+
804+
```bash
805+
# For Amazon S3
806+
pip install "litdata[extra]" s3fs
807+
808+
# For Google Cloud Storage
809+
pip install "litdata[extra]" gcsfs
810+
```
811+
812+
**Index Your Dataset**:
813+
814+
Index your Parquet dataset to create an index file that LitData can use to stream the dataset.
782815

783816
```python
784817
import litdata as ld
785818

786-
pq_data_uri = "gs://deep-litdata-parquet/my-parquet-data"
819+
# Point to your data stored in the cloud
820+
pq_dataset_uri = "s3://my-bucket/my-parquet-data" # or "gs://my-bucket/my-parquet-data"
787821

788-
ld.index_parquet_dataset(pq_data_uri)
822+
ld.index_parquet_dataset(pq_dataset_uri)
789823
```
790824

791-
- **Stream the dataset with `StreamingDataset` and `ParquetLoader`**
825+
**Stream the Dataset**
826+
827+
Use `StreamingDataset` with `ParquetLoader` to load and stream the dataset efficiently:
792828

793-
When using a Streaming Dataset, ensure you use `ParquetLoader`:
794829

795830
```python
796831
import litdata as ld
797832
from litdata.streaming.item_loader import ParquetLoader
798833

799-
ds = ld.StreamingDataset('gs://deep-litdata-parquet/my-parquet-data', item_loader = ParquetLoader())
834+
# Specify your dataset location in the cloud
835+
pq_dataset_uri = "s3://my-bucket/my-parquet-data" # or "gs://my-bucket/my-parquet-data"
836+
837+
# Set up the streaming dataset
838+
dataset = ld.StreamingDataset(pq_dataset_uri, item_loader=ParquetLoader())
800839

801-
for _ds in ds:
802-
print(f"{_ds=}")
840+
# print the first sample
841+
print("Sample", dataset[0])
842+
843+
# Stream the dataset using StreamingDataLoader
844+
dataloader = ld.StreamingDataLoader(dataset, batch_size=4)
845+
for sample in dataloader:
846+
pass
803847
```
804848

805849
</details>

tests/streaming/test_parquet.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,6 @@ def test_stream_hf_parquet_dataset(monkeypatch, huggingface_hub_fs_mock, pq_data
188188
assert _ds["height"] == pq_data["height"][idx]
189189

190190
# Test case 3: Streaming with passing item_loader
191-
print("pre_load_chunk", pre_load_chunk, "low_memory", low_memory)
192191
ds = StreamingDataset(hf_url, item_loader=ParquetLoader(pre_load_chunk, low_memory))
193192
assert len(ds) == 25
194193
for i, _ds in enumerate(ds):

0 commit comments

Comments
 (0)