Passing many small files to docker is slow? #3481

JorisJJB · 2022-12-09T01:58:03Z

JorisJJB
Dec 9, 2022

I'm running a pipeline with AWS Batch, where one process takes a few thousand small (less than 1MB) files as an input. I noticed that the pipeline becomes a lot slower at this step as the number of files increases, even when the total combined size of all files stays the same. The difference can be a few hours for several thousand files vs a few minutes for a few dozen. The process itself is very fast either way according to the Nextflow run report (a few minutes), and the Nextflow operators before the process (a combination of collect() and groupTuple()) I confirmed to be fast as well. My hypothesis is now that loading lots of small files onto the docker is very slow.
Is this a known issue? And are there any workarounds? Any suggestions would be much appreciated!

mribeirodantas · 2022-12-13T02:49:29Z

mribeirodantas
Dec 13, 2022
Collaborator

I believe this is related to job submission/container management overhead. There is a Nextflow pattern that tries to help decrease this overhead (check it here).

If I recall correctly, in AWS Batch, every task is encapsulated in a container. If you have 10 files that will be handled by process A, this means running 10 containers just for this process A. If you have 100000 files, this means running 100000 containers.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Passing many small files to docker is slow? #3481

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Passing many small files to docker is slow? #3481

Uh oh!

JorisJJB Dec 9, 2022

Replies: 1 comment

Uh oh!

mribeirodantas Dec 13, 2022 Collaborator

JorisJJB
Dec 9, 2022

mribeirodantas
Dec 13, 2022
Collaborator