Skip to content

bug(collection DOI update): dataset stuck on initialize after updating collection #7396

@Bento007

Description

@Bento007

Describe the bug

A curator updated the DOI on a private collection which started the metadata update for the datasets. This batch job does the update. A few datasets got stuck on initialize. The batch job showed this error

error: DockerTimeoutError: Could not transition to created; timed out after waiting 4m0s

The error is related to ECR failing to create the docker container. This not something we have direct control over. However retry logic should be able restart the job without any intervention. There is retry logic in the terraform for this batch job, but looking at the failing batch job there isn't any retry logic
image

Work Around

If this is encountered, the batch job can be restarted manually by cloning the job for each stuck dataset.

Expected behavior

  • The job is retried on the above error.
  • a nice to have would be program to check for stuck or failed dataset_metadata_update batch jobs and set the status of the datasets to an appropriate error message.
    • There are no cloudwatch error message since this was and AWS transient error. So alert and metric will not tell us it failed
    • A step function could be used, but is likely overkill
    • Adjusting the retry logic in the batch job is the best solution as long as it can catch these transient AWS errors.

Environment

first discovered in 46431eb

Additional Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Priority 2 - Improvement with narrower impact, fix within a monthbugSomeone made a missteak...curator requestImprovements requested by curatorsdpData Platform workstream

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions