parallelization opportunity

https://github.com/google-deepmind/alphafold/blob/020cd6d6cb16540114a084f9dbb8f21f811f9d21/scripts/download_pdb_mmcif.sh#L53C48-L54C1

There are 200,000+ *.cif.gz files to unzip and the current process is strictly serial and will run only on a single cpu core. 

Most systems will have GNU/parallel available which is ready to parallelize and load-balance the extraction of these files. It would be simple to check for availability and use it, and fail back to the reliable serial process if parallel is not found.

Something like this:

```
if command -v parallel 2>&1 >/dev/null
then
   find ${RAW} -type -f -name "*.gz" | parallel -j 8 gunzip {}
else
   find ${RAW} -type f -name "*.gz" -exec gunzip {} +
fi
```

I picked 8 concurrent jobs since Alphafold seems to use the same as the default number of threads for other aspects of its operation. You might have other ideas about how to handle that. 

I install Alphafold in a university cluster-computing environment for researchers and we keep one centralized set of the databases Alphafold references on a network-attached storage cluster. The pdb_mmcif databases are essentially a worst case scenario for this i/o model since they are organized as a huge quantity of tiny files. Anything we can do to accelerate the download/unzip/access for these would be helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

parallelization opportunity #1075

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

parallelization opportunity #1075

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions