Replies: 1 comment
-
I believe this is related to job submission/container management overhead. There is a Nextflow pattern that tries to help decrease this overhead (check it here). If I recall correctly, in AWS Batch, every task is encapsulated in a container. If you have 10 files that will be handled by process A, this means running 10 containers just for this process A. If you have 100000 files, this means running 100000 containers. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm running a pipeline with AWS Batch, where one process takes a few thousand small (less than 1MB) files as an input. I noticed that the pipeline becomes a lot slower at this step as the number of files increases, even when the total combined size of all files stays the same. The difference can be a few hours for several thousand files vs a few minutes for a few dozen. The process itself is very fast either way according to the Nextflow run report (a few minutes), and the Nextflow operators before the process (a combination of collect() and groupTuple()) I confirmed to be fast as well. My hypothesis is now that loading lots of small files onto the docker is very slow.
Is this a known issue? And are there any workarounds? Any suggestions would be much appreciated!
Beta Was this translation helpful? Give feedback.
All reactions