-
-
Notifications
You must be signed in to change notification settings - Fork 16
Training stop/resume and checkpointingΒ #173
Copy link
Copy link
Open
Labels
Type: Improvement πPerformance improvement not introducing a new feature or requiring a major refactorPerformance improvement not introducing a new feature or requiring a major refactorType: New Feature βIntroduction of a completely new addition to the codebaseIntroduction of a completely new addition to the codebase
Milestone
Description
Feature Description
With Training API (#172 ) in place, we can add ability to stop training and save intermediate training info to resume training later.
// Start the training
// Training object would contain current epoch, batch, modelParameters
training = Job.train(...)
Suggested API:
// Stop training
training.stop()
New events in Job.train: 'stop'
// User-defined serialization (serialize/unserialize/storage is up to user)
serialized_checkpoint = serialize(training)
unserialized_checkpoint = unserialize(serialized_checkpoint)
// Supplying checkpoint back to Job.train
training = Job.train(trainingPlan, {
...
checkpoint: unserialized_checkpoint
})
Training loop should read properties of checkpoint
and load model params, epoch, step, batchSize, etc. from it.
What alternatives have you considered?
API was discussed in FL team.
Additional Context
See #172
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Type: Improvement πPerformance improvement not introducing a new feature or requiring a major refactorPerformance improvement not introducing a new feature or requiring a major refactorType: New Feature βIntroduction of a completely new addition to the codebaseIntroduction of a completely new addition to the codebase