Add support for resuming W&B runs by ID#305
Add support for resuming W&B runs by ID#305josancamon19 wants to merge 9 commits intothinking-machines-lab:mainfrom
Conversation
…gurations - Added `resume_wandb_run_id` parameter to `CLIConfig` and `Config` classes. - Updated `WandbLogger` to handle resuming runs, including step offset detection. - Modified logging setup to accommodate resuming existing W&B runs.
|
Not finished yet, there’s still some cleanup and a few edge cases to handle, but I’d love to hear your thoughts before I get to that. |
|
Thanks for writing this up, @josancamon19. There's always the option of just using the env vars We could even add a note (e.g., in recipes/README.md under |
|
yep, makes a lot of sense! |
|
Hey @tyler-griggs, was adding a minor PR with docs changes, but after testing this, noticed that it doesn't work due to step unsync added minimal changes and wrapped them inside WandbLogger to fix this + readme update |
|
ping here, @tyler-griggs |
This ended up being a bit trickier than I expected—especially from a UX standpoint.
The current implementation can successfully resume a run, but it should only be used when you want to continue training longer with the same configuration.
If you pass a new config, it won’t update in W&B (and it really should be a new run). Also, if you resume from a checkpoint that isn’t the final checkpoint, the graphs can become messy and inconsistent.
This won’t happen by default, but giving users an easy failure mode like this makes me uneasy, and it makes me question whether we should expose it as an option at all.
Any thoughts from @joschu or @tyler-griggs would be really helpful.