-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Labels
P2Priority 2 - Improvement with narrower impact, fix within a monthPriority 2 - Improvement with narrower impact, fix within a monthbugSomeone made a missteak...Someone made a missteak...curator requestImprovements requested by curatorsImprovements requested by curatorsdpData Platform workstreamData Platform workstream
Description
Describe the bug
A curator updated the DOI on a private collection which started the metadata update for the datasets. This batch job does the update. A few datasets got stuck on initialize. The batch job showed this error
error: DockerTimeoutError: Could not transition to created; timed out after waiting 4m0s
The error is related to ECR failing to create the docker container. This not something we have direct control over. However retry logic should be able restart the job without any intervention. There is retry logic in the terraform for this batch job, but looking at the failing batch job there isn't any retry logic

Work Around
If this is encountered, the batch job can be restarted manually by cloning the job for each stuck dataset.
Expected behavior
- The job is retried on the above error.
- a nice to have would be program to check for stuck or failed dataset_metadata_update batch jobs and set the status of the datasets to an appropriate error message.
- There are no cloudwatch error message since this was and AWS transient error. So alert and metric will not tell us it failed
- A step function could be used, but is likely overkill
- Adjusting the retry logic in the batch job is the best solution as long as it can catch these transient AWS errors.
Environment
first discovered in 46431eb
Additional Context
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P2Priority 2 - Improvement with narrower impact, fix within a monthPriority 2 - Improvement with narrower impact, fix within a monthbugSomeone made a missteak...Someone made a missteak...curator requestImprovements requested by curatorsImprovements requested by curatorsdpData Platform workstreamData Platform workstream