Skip to content

LWS job is deleted if service is enabled #1358

@Edwinhr716

Description

@Edwinhr716

When we build the service object to deploy it, we first attempt to fetch the LWS object here https://github.com/apple/axlearn/blob/main/axlearn/cloud/gcp/k8s_service.py#L163. This is done too soon after the LWS create request is sent, meaning that it will hit a 404 error. This forces bastion to retry to create the LWS object again. However this time, the create request returns a 409 error since it already exists, triggering a failing loop, until bastion gives up and deletes the job.

The fix here is to catch the 404 error, and only continue once a 200 error is returned

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions