Skip to content

Abort k8s job using CLI #1156

@clumsy

Description

@clumsy

Description

We only have cancel runner API which is implemented using delete_namespaced_custom_object which will effectively remove any traces of the jobs from the cluster.

Alternatively we want to abort a job, e.g. using vcjob["status"]["state"]["phase"] = "Aborted" + replace_namespaced_custom_object_status such that the pod spec remains available to (1) inspect the contents, (2) clone via apply command.

Motivation/Background

The users are forced to implement this manually and avoid using torchx CLI for k8s scehduler

Detailed Proposal

Either change the behavior of cancel (abort seems closer to cancel semantics than delete), or add a new runner API.

Alternatives

Don't use torchx CLI for k8s and use a custom script to abort the job

Additional context/links

https://kubernetes-asyncio.readthedocs.io/en/latest/kubernetes_asyncio.client.api.custom_objects_api.html#kubernetes_asyncio.client.api.custom_objects_api.CustomObjectsApi.replace_namespaced_custom_object_status

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions