-
Notifications
You must be signed in to change notification settings - Fork 138
Propose to fix wandb session not re-used when resume_from_checkpoint is used
#419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Saves the Weights & Biases run ID to the checkpoint file during training. When resuming from a checkpoint, this ID is loaded and used to initialize the W&B tracker, ensuring that logging continues in the same run. This prevents the creation of new, separate runs when a job is restarted.
resume_from_checkpoint is usedwandb session not re-used when resume_from_checkpoint is used
Adds a comprehensive test suite to verify that wandb runs can be correctly resumed from a saved checkpoint. This prevents the creation of a new wandb run upon resumption, ensuring a continuous experiment history. The tests cover the following scenarios: - The core logic of resuming a run using a `resume_run_id`. - Verification that both `PTDCheckpointer` and `AccelerateCheckpointer` save the `wandb_run_id`. - The end-to-end resumption flow for `SFTTrainer` and `ControlTrainer`. - Introspection checks to confirm trainers include the necessary logic to extract and use the run ID from a checkpoint. Fixes huggingface#188
Adds comprehensive regression tests to reproduce the wandb run resumption failure reported in issue huggingface#188. The new tests simulate a full training lifecycle: 1. Start a training run and log metrics with the `WandbTracker`. 2. Save a checkpoint partway through. 3. Stop the initial run. 4. Start a new session and load the checkpoint. 5. Initialize a new `WandbTracker` using the run ID from the checkpoint. The tests assert that the resumed tracker uses the original wandb run ID, rather than creating a new run. Separate tests are included for both the `AccelerateCheckpointer` and `PTDCheckpointer` to ensure the bug is captured for both implementations. Fixes huggingface#188
…ndb resumption logic
Introduces a new integration test to verify that the WandB session is correctly resumed when training continues from a saved checkpoint. This ensures that experiment tracking data is consolidated into a single WandB run across multiple training sessions, rather than creating a new run upon each resumption.
…int argument type in SFTTrainerLoRAWandbResumeTests
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR proposes to fix #188
Save the Weights & Biases run ID to the checkpoint file during training and load it when resuming from a checkpoint. This ensures logging continues in the same run, preventing the creation of new runs upon job restart.
I am new to this library, so this PR is open for any suggestions and simplifications.
@sayakpaul @a-r-r-o-w