-
Notifications
You must be signed in to change notification settings - Fork 43
Add first draft of resuming training from latest checkpoint #638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 11 commits
Commits
Show all changes
50 commits
Select commit
Hold shift + click to select a range
e6523f4
Add first draft of resuming training from latest checkpoint
mina-parham 50c58af
Ruff and add resume_from_checkpoint endpoint
mina-parham 771be20
Merge branch 'main' into add/resume-training-checkpoint
mina-parham 976baea
Ruff
mina-parham 841c0a7
Move resume checkpoint endpoint to remote.py
mina-parham cdd5f55
Add resume from checkpoint endpoint
mina-parham 90677be
Ruff
mina-parham 0f29b56
Ruff
mina-parham d0c472a
Merge branch 'main' into add/resume-training-checkpoint
mina-parham 42273eb
Fix the bug related to passing the checkpoint
mina-parham 2f8844d
Use HTTPException with proper status codes (400/404/500) instead of …
mina-parham 9e4a615
use parent job data and proper error handling
mina-parham e54f02f
Merge branch 'main' into add/resume-training-checkpoint
mina-parham 0cfa5f8
Ruff
mina-parham 0909b8d
Merge branch 'add/resume-training-checkpoint' of https://github.com/t…
mina-parham eedada5
Debug
mina-parham cb59a9a
Debug
mina-parham 4f19401
Merge branch 'main' into add/resume-training-checkpoint
mina-parham d2dc3ad
Debug
mina-parham bc30c47
Merge branch 'add/resume-training-checkpoint' of https://github.com/t…
mina-parham 276a028
Clean up resume_from_checkpoint debugging code
mina-parham 139edb6
Fix checkpoint resume functionality for remote training jobs
mina-parham e59f0cb
Fix checkpoint resume: add --resume_from_checkpoint to python command
mina-parham e3813f1
Refactor resume_from_checkpoint to reuse launch logic
mina-parham b6c91f7
Merge branch 'main' into add/resume-training-checkpoint
mina-parham 3254e93
Ruff
mina-parham acc7215
Merge branch 'add/resume-training-checkpoint' of https://github.com/t…
mina-parham 272e311
Fix security vulnerability
mina-parham 4a5b703
Ruff
mina-parham 58a9d8d
Potential fix for code scanning alert no. 513: Uncontrolled data used…
mina-parham d97f6f7
Merge launch_remote and resume_from_checkpoint functions
mina-parham 120dbbf
Debug load checkpoint problem
mina-parham 5b6f982
Remove redundant print and ruff
mina-parham 5238ca9
Fix resume training validation for missing cluster_name
mina-parham 0fb8b50
Merge branch 'main' into add/resume-training-checkpoint
mina-parham df85fea
Pass checkpoint metadata via env vars
mina-parham c70bab4
Merge branch 'add/resume-training-checkpoint' of https://github.com/t…
mina-parham d545d82
Bump the version of sdk into 0.0.41
mina-parham b453102
Add logs for debugging
mina-parham 321403e
Restore stream_logs from main
mina-parham 8806ff8
Restore stream_logs from main
mina-parham e0246cf
Restore stream_logs from main
mina-parham b8d237d
Ruff
mina-parham bf7fed0
Remove prints
mina-parham a6d715b
Merge job_data and request_data
mina-parham 85310c0
stores parent_job_id and resumed_from_checkpoint in new job's data
mina-parham 51bda4f
Remove comments
mina-parham c100901
Merge branch 'main' into add/resume-training-checkpoint
mina-parham 6a3694a
Merge branch 'main' into add/resume-training-checkpoint
deep1401 15472e2
merge conflict
mina-parham File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a doubt as to what is the difference between using this and the launch/remote?
Maybe we just add extra params there and use that so we dont have duplicate routes? But if there is extra logic in here then please let me know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've merged these two routes