Skip to content

[Feature Request] Resume from Checkpoint #362

@pbontrager

Description

@pbontrager

Resuming from checkpoint is very important for recovering from crashes or for extending training but there are a lot of moving parts in RL pipelines that we need to decide on. At a minimum we need all of the things that are need from regular training:

  • Checkpoint
  • Optimizer State
  • LR Schedulers (with option to extend)
  • data step
  • seed (if you can use it)

But on top of this there are a lot of other things that could be restored in an RL pipeline

  • replay buffer data
  • states of any tools or stateful services
  • any additional models critics or reward models that get updated

I think we can start with just the first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions