Incorrect Dataset Shuffling
- Currently, in
gpt_dataset.py, the dataset is being globally shuffled across epochs rather than within epoch shuffling which is the standard.
- Both shuffle index code and document index code are being shuffled across epochs.
Question
Has this been done on purpose? Is there any reason to prefer global shuffling over per-epoch shuffling?
Solution
Shuffle data per epoch instead of shuffling the full data. Implementation is straightforward. However, we need to fix both document and shuffle index to fix the overall problem.