-
Notifications
You must be signed in to change notification settings - Fork 178
Description
Hi,
I am trying to download viral genomes from genbank. I have put the commands inside a nextflow script and it runs on the login node of our HPC cluster.
The command inside the nextflow script that I use is this:
ncbi-genome-download --formats fasta --section genbank viral --parallel 4 --flat-output -r 5 -P -o genbank_genomes
What I encounter is that your tool is first doing the checking of the assemblies and that finishes when I do it for the refseq genomes. But for the genbank genomes it is a lot more genomes that need to be checked.
Checking assemblies: 23%|██▎ | 43268/187089 [1:34:09<4:41:54, 8.50entries/s](ncbidown33)
There is two things I notice.
- the connection to NCBI is failing. But this is a problem you know, which you have addressed by making it possible to set the number of retries. I have that currently at 5 times.
- The whole job is killed on my login node, before the number of retries reaches 5. I notice that it has restarted the process several times because it somehow stalls. I get the exitcode 143. Which usually means the process gets killed externally. The job stops at different places, it can be early, but it can also be after having checked 99% of the assemblies.
So I wonder what to do here?
I have contacted the admin of the HPC I am using to see if they have an idea.
Would it help to have more parallel processes? Are those used in the checking step? But I might eat up more cpus on the login node of our cluster. I can not use the compute nodes, since they have no access to the internet.
Or would there be another way of breaking up the checking of assemblies, so that I can make batches which are smaller and will finish.
I know the cache file that is created contains the ftp location of the genome. By grabbing that I can download all genomes, but why would I then use this tool.
Any suggestions you might have are welcome