Skip to content

Incorrect shuffling of documents across epochs in GPTDataset #697

@argitrage

Description

@argitrage

Incorrect Dataset Shuffling

  • Currently, in gpt_dataset.py, the dataset is being globally shuffled across epochs rather than within epoch shuffling which is the standard.
  • Both shuffle index code and document index code are being shuffled across epochs.

Question
Has this been done on purpose? Is there any reason to prefer global shuffling over per-epoch shuffling?

Solution
Shuffle data per epoch instead of shuffling the full data. Implementation is straightforward. However, we need to fix both document and shuffle index to fix the overall problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleNo activity in 60 days on issue or PR

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions