Incorrect shuffling of documents across epochs in GPTDataset

**Incorrect Dataset Shuffling**
- Currently, in `gpt_dataset.py`, the dataset is being globally shuffled across epochs rather than within epoch shuffling which is the standard.
- Both shuffle index [code](https://github.com/NVIDIA/Megatron-LM/blob/5f9c870f9f24b482509699d206a9dbb00958f6fc/megatron/core/datasets/gpt_dataset.py#L565) and document index [code](https://github.com/NVIDIA/Megatron-LM/blob/5f9c870f9f24b482509699d206a9dbb00958f6fc/megatron/core/datasets/gpt_dataset.py#L532) are being shuffled across epochs. 

**Question**
Has this been done on purpose? Is there any reason to prefer global shuffling over per-epoch shuffling?

**Solution**
Shuffle data per epoch instead of shuffling the full data. Implementation is straightforward. However, we need to fix both document and shuffle index to fix the overall problem.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect shuffling of documents across epochs in GPTDataset #697

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect shuffling of documents across epochs in GPTDataset #697

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions