[Feature Request] Resume from Checkpoint

Resuming from checkpoint is very important for recovering from crashes or for extending training but there are a lot of moving parts in RL pipelines that we need to decide on. At a minimum we need all of the things that are need from regular training:

- Checkpoint
- Optimizer State
- LR Schedulers (with option to extend)
- data step
- seed (if you can use it)

But on top of this there are a lot of other things that could be restored in an RL pipeline

- replay buffer data
- states of any tools or stateful services
- any additional models critics or reward models that get updated

I think we can start with just the first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Resume from Checkpoint #362

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Resume from Checkpoint #362

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions