-
Notifications
You must be signed in to change notification settings - Fork 776
Description
New feature
AWS Batch, by default, will attempt to schedule multiple tasks on the same VM. This is great from a cost/time perspective but can have some unwanted side effects.
We consistently see CannotPullContainerError errors, which we believe result from exhausting the boot disk on the VM (default: 30 GB) by attempting to source the Docker containers for all scheduled tasks.
The last 5 pipeline errors we see that were submitted on subsequent days:
Caused by:
Task failed to start - CannotPullContainerError: write /var/lib/docker/tmp/GetImageBlob887491687: no space left on device
Caused by:
Task failed to start - CannotPullContainerError: failed to register layer: write /usr/local/x86_64-conda-linux-gnu/sysroot/usr/lib64/locale/locale-archive.tmpl: no space left on device
Caused by:
Task failed to start - CannotPullContainerError: failed to register layer: write /usr/local/bin/ripples-fast: no space left on device
Caused by:
Task failed to start - CannotPullContainerError: failed to register layer: write /usr/local/share/doc/gettext/examples/hello-c++-kde/admin/am_edit: no space left on device
Caused by:
Task failed to start - CannotPullContainerError: failed to register layer: write /usr/local/lib/libopenblasp-r0.3.10.so: no space left on device
An example of a failed pipeline can be found in the community/showcase Workspace on Seqera Platform:
https://cloud.seqera.io/orgs/community/workspaces/showcase/watch/3EdFagC3x2Q6yK
I have uploaded the .nextflow.log file here for convenience:
nf-3EdFagC3x2Q6yK.log
Usage scenario
Running NF pipelines on AWS Batch will pack multiple tasks in the same instance by default.
Suggest implementation
I believe the current NF retry strategy will not work to catch this particular error because the task doesn't generate an exit code and fails to start.
It would be great to automatically retry a task submission by catching the CannotPullContainerError error. If this error is too generic to catch, we could also look more specifically for no space left on device.