Skip to content

How to shuffle & optimize Hugging Face datasets for LLM pre-training with StreamingDataset ? #696

@pietrolesci

Description

@pietrolesci

Bug Report

Description

When using StreamingDataset with shuffle=True, accessing individual samples (e.g., dataset[0]) returns the same sample every time, despite shuffle being enabled. This makes it impossible to verify shuffling is working or to access random samples directly from the dataset.

Environment

  • LitData version: 0.2.52
  • Python version: 3.11.10

Minimal Reproduction Code

from pathlib import Path
import litdata as ld
from litdata.streaming.item_loader import ParquetLoader

INDEX_PATH = Path(".litdata_index")
CACHE_DIR = Path(".litdata_cache")

# Define the Hugging Face dataset URI
hf_dataset_uri = "hf://datasets/pietrolesci/finewebedu-20B/bpe32000minipile"
INDEX_PATH.mkdir(exist_ok=True, parents=True)
ld.index_parquet_dataset(hf_dataset_uri, cache_dir=INDEX_PATH)

dataset = ld.StreamingDataset(
    hf_dataset_uri, 
    item_loader=ParquetLoader(), 
    cache_dir=CACHE_DIR, 
    index_path=INDEX_PATH,
    shuffle=True,
    drop_last=True,
    max_pre_download=5,
)

# This returns the same sample every time
print("Sample 0:", dataset[0])
print("Sample 0 again:", dataset[0])

For context, I ran into this issue while trying to use LitData to transform a Hugging Face dataset into a packed dataset for LLM pretraining. To simplify, I started from a tokenized dataset and wanted to:

  • Shuffle the documents
  • Pack them into equal-length sequences
  • Shuffle the sequences

Any tips about this would be really helpful. Thanks a lot in advance for your attention!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions