AWS Fusion integration is less efficient than NO Fusion when concatenating a large number of files #3844

wikiselev · 2023-04-06T10:27:22Z

wikiselev
Apr 6, 2023

Bug report

AWS Fusion integration is less efficient than NO Fusion when concatenating a large number of files (~20,000).

Expected behavior and actual behavior

Using Fusion is supposed to make file operations faster, but I observe the opposite.

Steps to reproduce the problem

My pipeline has a large number of the same processes that generate a large number of csv files. These files contain only a single row with 10 columns each (200-300 bytes in size). The files are then concatenated together using a single process with the following script:

find . -name "*.csv" -exec cat {} \\; > combined_raw.txt

Program output

I've tested the pipeline for different number of csv files and with/without Fusion and got the following results:

# of csv files	With Fusion	Without Fusion
32	OK (12GB RAM, 1 min)	Didn't run
1,000	OK (12GB RAM, 36 min)	Didn't run
10,000	OK (24GB RAM, 5.5 hours)	Didn't run
20,000	Failed with both 12GB and 24GB of RAM (`maxRetries = 5`), cancelled after waiting too long with 36GB of RAM	OK (24GB RAM, 3 hours)

Environment

Tower + AWS Batch Spot instances

pditommaso · 2023-04-06T10:42:38Z

pditommaso
Apr 6, 2023
Maintainer

Look @jordeu new challenge for easter time! 😆

@wikiselev should cat *.csv > combined_raw.txt do the same?

0 replies

jordeu · 2023-04-06T10:47:41Z

jordeu
Apr 6, 2023
Collaborator

How do you setup the Compute Environment? Are you using Tower? With "fast instance storage" enabled?

0 replies

wikiselev · 2023-04-06T11:03:21Z

wikiselev
Apr 6, 2023
Author

Look @jordeu new challenge for easter time! 😆

@wikiselev should cat *.csv > combined_raw.txt do the same?

Yes, that was my original code, but then I found some discussions about a too large number of arguments for cat and decided to use find to be on a safe side.

0 replies

wikiselev · 2023-04-06T11:04:31Z

wikiselev
Apr 6, 2023
Author

How do you setup the Compute Environment? Are you using Tower? With "fast instance storage" enabled?

Yes, this is exactly the setup that I call With Fusion.

0 replies

jordeu · 2023-04-06T11:09:43Z

jordeu
Apr 6, 2023
Collaborator

These 20.000 files are generated by 20.000 different processes or and then declared as input of the "concatenate" process?

0 replies

wikiselev · 2023-04-06T11:15:54Z

wikiselev
Apr 6, 2023
Author

These 20.000 files are generated by 20.000 different processes or and then declared as input of the "concatenate" process?

They are generated by 20,000 processes. In the workflow I pass generate_files.out.collect() to the process that concatenate the files.

0 replies

wikiselev · 2023-04-06T11:21:18Z

wikiselev
Apr 6, 2023
Author

Actually, after answering to @pditommaso I realised that the long failed run was using just cat without find. Though I think (hope!) it does not make a difference... So the table describing scripts used in the concatenation process looks like this:

# of csv files	With Fusion	Without Fusion
32	`find` + `cat`	Didn't run
1,000	`find` + `cat`	Didn't run
10,000	`find` + `cat`	Didn't run
20,000	`cat`	`find` + `cat`

0 replies

jordeu · 2023-04-06T11:44:34Z

jordeu
Apr 6, 2023
Collaborator

I'll test this use case. It's a difficult one and I don't expect to get better results, but it should be possible to at least give similar results.

Fusion performance improvements come from the fact that it can download/upload files on the background while the process is running. Also is design to improve the performance when dealing with big files.

And this use case is a bit of the opposite, just a pure download and upload of small files and nearly nothing is done by the process.

The best solution that you can build now (with or without Fusion) is to add an intermediate step that in parallel concatenates batches of 1000 files and then a final process that collects all of them into a single file.

0 replies

pditommaso · 2023-04-06T11:47:40Z

pditommaso
Apr 6, 2023
Maintainer

I wonder if could just be handled via Nextflow collectFile?

1 reply

wikiselev Apr 11, 2023
Author

I wonder if could just be handled via Nextflow collectFile?

Thanks, Paolo! This approach seems to work. In this case the concatenation is performed by Nextflow itself in the NF main process/job. It takes some time (probably more than an hour) and during that time Tower does not do anything (does not start the next process until collectFile is finished). So, may be confusing for someone as it is not clear what is happening during that time...

It would also be good to add a use case of sending multiple files to colectFile to the documentation on collectFile. Currently there are only examples with multiple items and a single file.

Thanks again, for you quick response, @pditommaso , the issues seems to be resolved for me.

pditommaso · 2023-04-06T12:58:03Z

pditommaso
Apr 6, 2023
Maintainer

Moving this to the discussion, because it's not a Nextflow issue

0 replies

jordeu · 2023-04-19T09:31:05Z

jordeu
Apr 19, 2023
Collaborator

We found a bug that was making this use case highly inefficient. We've fixed it on latest Fusion version automatically available when running Nextflow 23.04

With the new version when we run this pipeline

nextflow run jordeu/nf-tests -r filecollect -profile fusion --files 20000

with an AWS Batch + Fusion v2 + fast storage Tower compute environment the collector process that concatenates 20k files it takes 37 minutes.

2 replies

wikiselev Apr 20, 2023
Author

Amazing, thanks for the update! I will need some time to test this. Will update once done.

pditommaso Apr 20, 2023
Maintainer

Go Vlad! break the system! 😆

AWS Fusion integration is less efficient than NO Fusion when concatenating a large number of files #3844

Uh oh!

wikiselev Apr 6, 2023

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

Replies: 11 comments · 3 replies

Uh oh!

pditommaso Apr 6, 2023 Maintainer

Uh oh!

jordeu Apr 6, 2023 Collaborator

Uh oh!

Uh oh!

wikiselev Apr 6, 2023 Author

Uh oh!

wikiselev Apr 6, 2023 Author

Uh oh!

jordeu Apr 6, 2023 Collaborator

Uh oh!

wikiselev Apr 6, 2023 Author

Uh oh!

wikiselev Apr 6, 2023 Author

Uh oh!

jordeu Apr 6, 2023 Collaborator

Uh oh!

pditommaso Apr 6, 2023 Maintainer

Uh oh!

wikiselev Apr 11, 2023 Author

Uh oh!

pditommaso Apr 6, 2023 Maintainer

Uh oh!

jordeu Apr 19, 2023 Collaborator

Uh oh!

wikiselev Apr 20, 2023 Author

Uh oh!

pditommaso Apr 20, 2023 Maintainer

wikiselev
Apr 6, 2023

Replies: 11 comments 3 replies

pditommaso
Apr 6, 2023
Maintainer

jordeu
Apr 6, 2023
Collaborator

wikiselev
Apr 6, 2023
Author

wikiselev
Apr 6, 2023
Author

jordeu
Apr 6, 2023
Collaborator

wikiselev
Apr 6, 2023
Author

wikiselev
Apr 6, 2023
Author

jordeu
Apr 6, 2023
Collaborator

pditommaso
Apr 6, 2023
Maintainer

wikiselev Apr 11, 2023
Author

pditommaso
Apr 6, 2023
Maintainer

jordeu
Apr 19, 2023
Collaborator

wikiselev Apr 20, 2023
Author

pditommaso Apr 20, 2023
Maintainer