-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Proposed refactoring or deprecation
Instead of disabling shuffle / replacing RandomSampler
with SequentialSampler
in the train dataloader, replace the train dataset with a fixed subset of it using torch.utils.data.Subset
(eg. first N samples of the dataset, where N is given by overfit_batches
. This gives the same dataset samples as with the previous implementation.)
Motivation
This prevents training batches to be the same for every epoch
Pitch
Added on 12 Oct 2021:
The current implementation for overfit_batches
disables shuffling by replacing RandomSampler
with SequentialSampler
in the train dataloader, in order to restrict the training / overfit to the first N samples of the train dataset for every epoch. However, this gives the same sequence of batches & non-unique batches across epochs, which is undesirable.
We should instead allow shuffling within the N samples across epochs, according to the shuffle
option of the train dataloader, in order to give a different sequence of batches across epochs & mostly unique batches throughout the training process.
If you enjoy Lightning, check out our other projects! ⚡
-
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
-
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
-
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
-
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.