How to shuffle & optimize Hugging Face datasets for LLM pre-training with StreamingDataset ?

## Bug Report

### Description
When using `StreamingDataset` with `shuffle=True`, accessing individual samples (e.g., `dataset[0]`) returns the same sample every time, despite shuffle being enabled. This makes it impossible to verify shuffling is working or to access random samples directly from the dataset.

### Environment
- LitData version: 0.2.52
- Python version: 3.11.10

### Minimal Reproduction Code
```python
from pathlib import Path
import litdata as ld
from litdata.streaming.item_loader import ParquetLoader

INDEX_PATH = Path(".litdata_index")
CACHE_DIR = Path(".litdata_cache")

# Define the Hugging Face dataset URI
hf_dataset_uri = "hf://datasets/pietrolesci/finewebedu-20B/bpe32000minipile"
INDEX_PATH.mkdir(exist_ok=True, parents=True)
ld.index_parquet_dataset(hf_dataset_uri, cache_dir=INDEX_PATH)

dataset = ld.StreamingDataset(
    hf_dataset_uri, 
    item_loader=ParquetLoader(), 
    cache_dir=CACHE_DIR, 
    index_path=INDEX_PATH,
    shuffle=True,
    drop_last=True,
    max_pre_download=5,
)

# This returns the same sample every time
print("Sample 0:", dataset[0])
print("Sample 0 again:", dataset[0])
```

For context, I ran into this issue while trying to use LitData to transform a Hugging Face dataset into a packed dataset for LLM pretraining. To simplify, I started from a tokenized dataset and wanted to:

- Shuffle the documents
- Pack them into equal-length sequences
- Shuffle the sequences

Any tips about this would be really helpful. Thanks a lot in advance for your attention!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to shuffle & optimize Hugging Face datasets for LLM pre-training with StreamingDataset ? #696

Bug Report

Description

Environment

Minimal Reproduction Code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to shuffle & optimize Hugging Face datasets for LLM pre-training with StreamingDataset ? #696

Description

Bug Report

Description

Environment

Minimal Reproduction Code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions