Skip to content

Conversation

@rahimftd
Copy link
Contributor

Problem

The SkypilotExecutor cannot launch skypilot managed jobs, which support features such as automatic retries and recovery from spot preemptions.

Managed jobs use a different sdk than regular jobs. As such, the SkypilotExecutor cannot be used to launch both types of jobs.

Solution

This pr adds a SkypilotJobsExecutor and SkypilotJobsScheduler, which use the jobs sdk to launch managed jobs. The executor works with local and remote Skypilot API servers.

Example usage

executor = run.SkypilotJobsExecutor(
    gpus="H100",
    launcher="torchrun",
    gpus_per_node=8,
    env_vars={},
    num_nodes=4,
    container_image="nvcr.io/nvidia/nemo:dev",
    infra="kubernetes",
    idle_minutes_to_autostop=10,
    autodown=True,
    packager=run.GitArchivePackager(subpath="nemo_training"),
)
run.run(recipe, executor=executor, name=experiment_name, log_level="DEBUG")

Testing Strategy

  • Tested with a remote and local api server
  • Unit tests

@rahimftd
Copy link
Contributor Author

@romilbhardwaj

Signed-off-by: Rahim Dharssi <[email protected]>
@hemildesai
Copy link
Contributor

hemildesai commented Sep 29, 2025

Thanks for the amazing contribution. It looks like only check failing is the codecoverage check (78.07% there out of a minimum of 80). You can take a look at the missed lines here - https://app.codecov.io/gh/NVIDIA-NeMo/Run/pull/343?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=checks&utm_campaign=pr+comments&utm_term=NVIDIA-NeMo

(You can ignore the other failures)

Signed-off-by: Rahim Dharssi <[email protected]>
@rahimftd
Copy link
Contributor Author

@hemildesai Added some unit tests. Thanks!

Copy link
Contributor

@hemildesai hemildesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚢

@hemildesai hemildesai merged commit 6bc4319 into NVIDIA-NeMo:main Sep 30, 2025
19 of 22 checks passed
zoeyz101 pushed a commit to zoeyz101/NeMo-Run that referenced this pull request Nov 12, 2025
…o#343)

* Create SkypilotJobsExecutor to allow running managed jobs with Skypilot API

Signed-off-by: Rahim Dharssi <[email protected]>

* Remove unnecessary comments

Signed-off-by: Rahim Dharssi <[email protected]>

* fix lints

Signed-off-by: Rahim Dharssi <[email protected]>

* Add comment for suppressing import error

Signed-off-by: Rahim Dharssi <[email protected]>

* Write unit tests for _save_job_dir and _get_job_dirs

Signed-off-by: Rahim Dharssi <[email protected]>

* Fix lints

Signed-off-by: Rahim Dharssi <[email protected]>

---------

Signed-off-by: Rahim Dharssi <[email protected]>
Signed-off-by: Zoey Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants