-
Notifications
You must be signed in to change notification settings - Fork 147
Open
Description
Description
We only have cancel runner API which is implemented using delete_namespaced_custom_object which will effectively remove any traces of the jobs from the cluster.
Alternatively we want to abort a job, e.g. using vcjob["status"]["state"]["phase"] = "Aborted" + replace_namespaced_custom_object_status such that the pod spec remains available to (1) inspect the contents, (2) clone via apply command.
Motivation/Background
The users are forced to implement this manually and avoid using torchx CLI for k8s scehduler
Detailed Proposal
Either change the behavior of cancel (abort seems closer to cancel semantics than delete), or add a new runner API.
Alternatives
Don't use torchx CLI for k8s and use a custom script to abort the job
Additional context/links
Metadata
Metadata
Assignees
Labels
No labels