Skip to content

Conversation

@romilbhardwaj
Copy link
Contributor

@romilbhardwaj romilbhardwaj commented Sep 12, 2025

This PR adds a configurable retry_until_up parameter to the SkypilotExecutor class, allowing users to control whether SkyPilot should retry launching clusters indefinitely when they fail to come up.

Why is this needed

If the users' clusters are full, they may want to queue their jobs and wait till resources become available. This change enables users to queue their launches indefinitely by setting retry_until_up=True when initializing the executor. This is particularly useful for long-running experiments that can tolerate delayed starts and scenarios where compute resources are scarce and users want to wait for availability.

Discussion

There's open questions about how to handle ctrl+c and cancelling sky launch requests through nemo run. Current behavior appears to be nemo experiment cancel just cancels the task, but does not sky down the cluster. When coupled with retry_until_up=True, the sky launch request may keep running indefinitely and will need to be manually terminated with sky down or a sky api cancel.

We may want to have sky down run when nemo experiment cancel is run, but that seems like a issue independent of this PR so I'll let the Nemo Run maintainers take a call on this.

Signed-off-by: Romil Bhardwaj <[email protected]>
@hemildesai
Copy link
Contributor

Thanks @romilbhardwaj , created #341 to call sky down when cancelling.

@hemildesai hemildesai merged commit c16a572 into NVIDIA-NeMo:main Sep 12, 2025
20 of 22 checks passed
zoeyz101 pushed a commit to zoeyz101/NeMo-Run that referenced this pull request Nov 12, 2025
Signed-off-by: Romil Bhardwaj <[email protected]>
Signed-off-by: Zoey Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants