Skip to content

Add support for resuming W&B runs by ID#305

Open
josancamon19 wants to merge 9 commits intothinking-machines-lab:mainfrom
josancamon19:joan/fix-wandb-resume
Open

Add support for resuming W&B runs by ID#305
josancamon19 wants to merge 9 commits intothinking-machines-lab:mainfrom
josancamon19:joan/fix-wandb-resume

Conversation

@josancamon19
Copy link
Contributor

This ended up being a bit trickier than I expected—especially from a UX standpoint.

The current implementation can successfully resume a run, but it should only be used when you want to continue training longer with the same configuration.

If you pass a new config, it won’t update in W&B (and it really should be a new run). Also, if you resume from a checkpoint that isn’t the final checkpoint, the graphs can become messy and inconsistent.

This won’t happen by default, but giving users an easy failure mode like this makes me uneasy, and it makes me question whether we should expose it as an option at all.

Any thoughts from @joschu or @tyler-griggs would be really helpful.

…gurations

- Added `resume_wandb_run_id` parameter to `CLIConfig` and `Config` classes.
- Updated `WandbLogger` to handle resuming runs, including step offset detection.
- Modified logging setup to accommodate resuming existing W&B runs.
@josancamon19 josancamon19 marked this pull request as draft January 18, 2026 02:22
@josancamon19
Copy link
Contributor Author

Not finished yet, there’s still some cleanup and a few edge cases to handle, but I’d love to hear your thoughts before I get to that.

@tyler-griggs
Copy link
Contributor

Thanks for writing this up, @josancamon19. There's always the option of just using the env vars WANDB_RUN_ID and WANDB_RESUME. Wandb will pick these up automatically, so technically users can currently resume a wandb run by ID. This would simplify things quite a lot!

We could even add a note (e.g., in recipes/README.md under Resuming) like "To resume logging to an existing W&B run, set the following environment variables: WANDB_RUN_ID=<run_id> and WANDB_RESUME=must."

@josancamon19
Copy link
Contributor Author

yep, makes a lot of sense!

@josancamon19
Copy link
Contributor Author

Hey @tyler-griggs, was adding a minor PR with docs changes, but after testing this, noticed that it doesn't work due to step unsync

wandb: WARNING Tried to log to step 19 that is less than the current step 37. Steps must be monotonically increasing, so this data will be ignored. See https://wandb.me/define-metric to log data out of order.

added minimal changes and wrapped them inside WandbLogger to fix this + readme update

@josancamon19 josancamon19 marked this pull request as ready for review January 21, 2026 23:44
@josancamon19
Copy link
Contributor Author

ping here, @tyler-griggs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants