Skip to content

Automatic retry with CannotPullContainerError #5099

@drpatelh

Description

@drpatelh

New feature

AWS Batch, by default, will attempt to schedule multiple tasks on the same VM. This is great from a cost/time perspective but can have some unwanted side effects.

We consistently see CannotPullContainerError errors, which we believe result from exhausting the boot disk on the VM (default: 30 GB) by attempting to source the Docker containers for all scheduled tasks.

The last 5 pipeline errors we see that were submitted on subsequent days:

Caused by:
  Task failed to start - CannotPullContainerError: write /var/lib/docker/tmp/GetImageBlob887491687: no space left on device

Caused by:
  Task failed to start - CannotPullContainerError: failed to register layer: write /usr/local/x86_64-conda-linux-gnu/sysroot/usr/lib64/locale/locale-archive.tmpl: no space left on device

Caused by:
  Task failed to start - CannotPullContainerError: failed to register layer: write /usr/local/bin/ripples-fast: no space left on device

Caused by:
  Task failed to start - CannotPullContainerError: failed to register layer: write /usr/local/share/doc/gettext/examples/hello-c++-kde/admin/am_edit: no space left on device

Caused by:
  Task failed to start - CannotPullContainerError: failed to register layer: write /usr/local/lib/libopenblasp-r0.3.10.so: no space left on device

An example of a failed pipeline can be found in the community/showcase Workspace on Seqera Platform:
https://cloud.seqera.io/orgs/community/workspaces/showcase/watch/3EdFagC3x2Q6yK

I have uploaded the .nextflow.log file here for convenience:
nf-3EdFagC3x2Q6yK.log

Usage scenario

Running NF pipelines on AWS Batch will pack multiple tasks in the same instance by default.

Suggest implementation

I believe the current NF retry strategy will not work to catch this particular error because the task doesn't generate an exit code and fails to start.

It would be great to automatically retry a task submission by catching the CannotPullContainerError error. If this error is too generic to catch, we could also look more specifically for no space left on device.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions