-
Notifications
You must be signed in to change notification settings - Fork 91
Create SkypilotJobsExecutor to allow running managed jobs #343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create SkypilotJobsExecutor to allow running managed jobs #343
Conversation
…ot API Signed-off-by: Rahim Dharssi <[email protected]>
Signed-off-by: Rahim Dharssi <[email protected]>
0c8ee19 to
ab15256
Compare
Signed-off-by: Rahim Dharssi <[email protected]>
ab15256 to
dd82b84
Compare
Signed-off-by: Rahim Dharssi <[email protected]>
61bb39c to
4045098
Compare
|
Thanks for the amazing contribution. It looks like only check failing is the codecoverage check (78.07% there out of a minimum of 80). You can take a look at the missed lines here - https://app.codecov.io/gh/NVIDIA-NeMo/Run/pull/343?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=checks&utm_campaign=pr+comments&utm_term=NVIDIA-NeMo (You can ignore the other failures) |
Signed-off-by: Rahim Dharssi <[email protected]>
Signed-off-by: Rahim Dharssi <[email protected]>
f281c7f to
5259c72
Compare
|
@hemildesai Added some unit tests. Thanks! |
hemildesai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚢
…o#343) * Create SkypilotJobsExecutor to allow running managed jobs with Skypilot API Signed-off-by: Rahim Dharssi <[email protected]> * Remove unnecessary comments Signed-off-by: Rahim Dharssi <[email protected]> * fix lints Signed-off-by: Rahim Dharssi <[email protected]> * Add comment for suppressing import error Signed-off-by: Rahim Dharssi <[email protected]> * Write unit tests for _save_job_dir and _get_job_dirs Signed-off-by: Rahim Dharssi <[email protected]> * Fix lints Signed-off-by: Rahim Dharssi <[email protected]> --------- Signed-off-by: Rahim Dharssi <[email protected]> Signed-off-by: Zoey Zhang <[email protected]>
Problem
The
SkypilotExecutorcannot launch skypilot managed jobs, which support features such as automatic retries and recovery from spot preemptions.Managed jobs use a different sdk than regular jobs. As such, the
SkypilotExecutorcannot be used to launch both types of jobs.Solution
This pr adds a
SkypilotJobsExecutorandSkypilotJobsScheduler, which use the jobs sdk to launch managed jobs. The executor works with local and remote Skypilot API servers.Example usage
Testing Strategy