Skip to content

parallelization opportunity #1075

@matthewabrown

Description

@matthewabrown

https://github.com/google-deepmind/alphafold/blob/020cd6d6cb16540114a084f9dbb8f21f811f9d21/scripts/download_pdb_mmcif.sh#L53C48-L54C1

There are 200,000+ *.cif.gz files to unzip and the current process is strictly serial and will run only on a single cpu core.

Most systems will have GNU/parallel available which is ready to parallelize and load-balance the extraction of these files. It would be simple to check for availability and use it, and fail back to the reliable serial process if parallel is not found.

Something like this:

if command -v parallel 2>&1 >/dev/null
then
   find ${RAW} -type -f -name "*.gz" | parallel -j 8 gunzip {}
else
   find ${RAW} -type f -name "*.gz" -exec gunzip {} +
fi

I picked 8 concurrent jobs since Alphafold seems to use the same as the default number of threads for other aspects of its operation. You might have other ideas about how to handle that.

I install Alphafold in a university cluster-computing environment for researchers and we keep one centralized set of the databases Alphafold references on a network-attached storage cluster. The pdb_mmcif databases are essentially a worst case scenario for this i/o model since they are organized as a huge quantity of tiny files. Anything we can do to accelerate the download/unzip/access for these would be helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions