Skip to content

Training stop/resume and checkpointingΒ #173

@vvmnnnkv

Description

@vvmnnnkv

Feature Description

With Training API (#172 ) in place, we can add ability to stop training and save intermediate training info to resume training later.

// Start the training
// Training object would contain current epoch, batch, modelParameters
training = Job.train(...)

Suggested API:

// Stop training
training.stop()

New events in Job.train: 'stop'

// User-defined serialization (serialize/unserialize/storage is up to user)
serialized_checkpoint = serialize(training)
unserialized_checkpoint = unserialize(serialized_checkpoint)

// Supplying checkpoint back to Job.train
training = Job.train(trainingPlan, {
   ...
   checkpoint: unserialized_checkpoint
})

Training loop should read properties of checkpoint
and load model params, epoch, step, batchSize, etc. from it.

What alternatives have you considered?

API was discussed in FL team.

Additional Context

See #172

Metadata

Metadata

Assignees

Labels

Type: Improvement πŸ“ˆPerformance improvement not introducing a new feature or requiring a major refactorType: New Feature βž•Introduction of a completely new addition to the codebase

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions