Skip to content

Add data checkpoint within epoch feature #17

@floatingsun

Description

@floatingsun

Description

In the case when our dataset is super large, and we want to let the model walk through the dataset without replacement, may only for one or few epochs.
We can't do the training with oneshot due to time limition wall for each job. We need to add support to let the model dataloader recover from certain iter (within one epoch)

Solution
open_clip has give a solution that slice all shards into many sub set. And for each "sub_epoch" it walk through one sub set. Record our sub_epoch number and use it when start training to do the data checkpoint.
mlfoundations/open_clip#535

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions