-
Notifications
You must be signed in to change notification settings - Fork 88
Closed
Labels
discussionhelp wantedExtra attention is neededExtra attention is neededoptimizequestionFurther information is requestedFurther information is requested
Description
Bug Report
Description
When using StreamingDataset with shuffle=True, accessing individual samples (e.g., dataset[0]) returns the same sample every time, despite shuffle being enabled. This makes it impossible to verify shuffling is working or to access random samples directly from the dataset.
Environment
- LitData version: 0.2.52
- Python version: 3.11.10
Minimal Reproduction Code
from pathlib import Path
import litdata as ld
from litdata.streaming.item_loader import ParquetLoader
INDEX_PATH = Path(".litdata_index")
CACHE_DIR = Path(".litdata_cache")
# Define the Hugging Face dataset URI
hf_dataset_uri = "hf://datasets/pietrolesci/finewebedu-20B/bpe32000minipile"
INDEX_PATH.mkdir(exist_ok=True, parents=True)
ld.index_parquet_dataset(hf_dataset_uri, cache_dir=INDEX_PATH)
dataset = ld.StreamingDataset(
hf_dataset_uri,
item_loader=ParquetLoader(),
cache_dir=CACHE_DIR,
index_path=INDEX_PATH,
shuffle=True,
drop_last=True,
max_pre_download=5,
)
# This returns the same sample every time
print("Sample 0:", dataset[0])
print("Sample 0 again:", dataset[0])For context, I ran into this issue while trying to use LitData to transform a Hugging Face dataset into a packed dataset for LLM pretraining. To simplify, I started from a tokenized dataset and wanted to:
- Shuffle the documents
- Pack them into equal-length sequences
- Shuffle the sequences
Any tips about this would be really helpful. Thanks a lot in advance for your attention!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
discussionhelp wantedExtra attention is neededExtra attention is neededoptimizequestionFurther information is requestedFurther information is requested