-
Notifications
You must be signed in to change notification settings - Fork 397
Open
Description
When we build the service object to deploy it, we first attempt to fetch the LWS object here https://github.com/apple/axlearn/blob/main/axlearn/cloud/gcp/k8s_service.py#L163. This is done too soon after the LWS create request is sent, meaning that it will hit a 404 error. This forces bastion to retry to create the LWS object again. However this time, the create request returns a 409 error since it already exists, triggering a failing loop, until bastion gives up and deletes the job.
The fix here is to catch the 404 error, and only continue once a 200 error is returned
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels