Add support for resuming W&B runs by ID by josancamon19 · Pull Request #305 · thinking-machines-lab/tinker-cookbook

josancamon19 · 2026-01-18T01:50:26Z

This ended up being a bit trickier than I expected—especially from a UX standpoint.

The current implementation can successfully resume a run, but it should only be used when you want to continue training longer with the same configuration.

If you pass a new config, it won’t update in W&B (and it really should be a new run). Also, if you resume from a checkpoint that isn’t the final checkpoint, the graphs can become messy and inconsistent.

This won’t happen by default, but giving users an easy failure mode like this makes me uneasy, and it makes me question whether we should expose it as an option at all.

Any thoughts from @joschu or @tyler-griggs would be really helpful.

…gurations - Added `resume_wandb_run_id` parameter to `CLIConfig` and `Config` classes. - Updated `WandbLogger` to handle resuming runs, including step offset detection. - Modified logging setup to accommodate resuming existing W&B runs.

josancamon19 · 2026-01-18T02:22:57Z

Not finished yet, there’s still some cleanup and a few edge cases to handle, but I’d love to hear your thoughts before I get to that.

tyler-griggs · 2026-01-19T06:05:45Z

Thanks for writing this up, @josancamon19. There's always the option of just using the env vars WANDB_RUN_ID and WANDB_RESUME. Wandb will pick these up automatically, so technically users can currently resume a wandb run by ID. This would simplify things quite a lot!

We could even add a note (e.g., in recipes/README.md under Resuming) like "To resume logging to an existing W&B run, set the following environment variables: WANDB_RUN_ID=<run_id> and WANDB_RESUME=must."

josancamon19 · 2026-01-20T07:05:13Z

yep, makes a lot of sense!

josancamon19 · 2026-01-21T23:40:59Z

Hey @tyler-griggs, was adding a minor PR with docs changes, but after testing this, noticed that it doesn't work due to step unsync

wandb: WARNING Tried to log to step 19 that is less than the current step 37. Steps must be monotonically increasing, so this data will be ignored. See https://wandb.me/define-metric to log data out of order.

added minimal changes and wrapped them inside WandbLogger to fix this + readme update

josancamon19 · 2026-01-24T00:19:09Z

ping here, @tyler-griggs

josancamon19 mentioned this pull request Jan 18, 2026

suggested fixes and improvements to rl/train.py #281

Open

josancamon19 added 2 commits January 17, 2026 18:19

Fix pyright type error in WandB step detection

432f22f

removed pyproject.toml changes

c20c83f

josancamon19 marked this pull request as draft January 18, 2026 02:22

josancamon19 closed this Jan 20, 2026

josancamon19 reopened this Jan 21, 2026

josancamon19 added 4 commits January 21, 2026 15:01

simpler ml_logger

b9ee195

abstracted resume logic to wandb logger init

6fc1a6f

cleaned logic for _get_last_step_from_run

8b1bb7d

docstring removed

3af6713

josancamon19 added 2 commits January 21, 2026 15:42

Merge branch 'main' into joan/fix-wandb-resume

3ac0d1b

extra line fix pre-commit hook

a3c9e91

josancamon19 marked this pull request as ready for review January 21, 2026 23:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for resuming W&B runs by ID#305

Add support for resuming W&B runs by ID#305
josancamon19 wants to merge 9 commits intothinking-machines-lab:mainfrom
josancamon19:joan/fix-wandb-resume

josancamon19 commented Jan 18, 2026

Uh oh!

josancamon19 commented Jan 18, 2026

Uh oh!

tyler-griggs commented Jan 19, 2026

Uh oh!

josancamon19 commented Jan 20, 2026

Uh oh!

josancamon19 commented Jan 21, 2026

Uh oh!

josancamon19 commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

josancamon19 commented Jan 18, 2026

Uh oh!

josancamon19 commented Jan 18, 2026

Uh oh!

tyler-griggs commented Jan 19, 2026

Uh oh!

josancamon19 commented Jan 20, 2026

Uh oh!

josancamon19 commented Jan 21, 2026

Uh oh!

josancamon19 commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants