You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You don’t need to worry about indexing the dataset or any other setup. **LitData** will **handle all the necessary steps automatically** and `cache` the `index.json` file, so you won't have to index it again.
@@ -288,12 +308,12 @@ If the Hugging Face dataset hasn't been indexed yet, you can index it first usin
2. To stream HF datasets now, pass the `HF dataset URI`, the path where the `index.json` file is stored, and `ParquetLoader` as the `item_loader` to the **`StreamingDataset`**:
Below is the benchmark for the `Imagenet dataset (155 GB)`, demonstrating that **`optimizing the dataset using LitData is faster and results in smaller output size compared to raw Parquet files`**.
@@ -771,35 +791,59 @@ The `overwrite` mode will delete the existing data and start from fresh.
771
791
<summary> ✅ Stream parquet datasets</summary>
772
792
773
793
774
-
You can stream Parquet datasets directly without the need to convert them into the LitData optimized binary format.
794
+
Stream Parquet datasets directly with LitData—no need to convert them into LitData’s optimized binary format! If your dataset is already in Parquet format, you can efficiently index and stream it using `StreamingDataset` and `StreamingDataLoader`.
775
795
776
-
If your dataset is already in Parquet format, you can index and use it with StreamingDataset and DataLoader for efficient streaming.
796
+
**Assumption:**
777
797
778
-
Assumption:
779
798
Your dataset directory contains one or more Parquet files.
780
799
781
-
-**Index Parquet dataset**:
800
+
**Prerequisites:**
801
+
802
+
Install the required dependencies to stream Parquet datasets from cloud storage like **Amazon S3** or **Google Cloud Storage**:
803
+
804
+
```bash
805
+
# For Amazon S3
806
+
pip install "litdata[extra]" s3fs
807
+
808
+
# For Google Cloud Storage
809
+
pip install "litdata[extra]" gcsfs
810
+
```
811
+
812
+
**Index Your Dataset**:
813
+
814
+
Index your Parquet dataset to create an index file that LitData can use to stream the dataset.
0 commit comments