-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
There are 200,000+ *.cif.gz files to unzip and the current process is strictly serial and will run only on a single cpu core.
Most systems will have GNU/parallel available which is ready to parallelize and load-balance the extraction of these files. It would be simple to check for availability and use it, and fail back to the reliable serial process if parallel is not found.
Something like this:
if command -v parallel 2>&1 >/dev/null
then
find ${RAW} -type -f -name "*.gz" | parallel -j 8 gunzip {}
else
find ${RAW} -type f -name "*.gz" -exec gunzip {} +
fi
I picked 8 concurrent jobs since Alphafold seems to use the same as the default number of threads for other aspects of its operation. You might have other ideas about how to handle that.
I install Alphafold in a university cluster-computing environment for researchers and we keep one centralized set of the databases Alphafold references on a network-attached storage cluster. The pdb_mmcif databases are essentially a worst case scenario for this i/o model since they are organized as a huge quantity of tiny files. Anything we can do to accelerate the download/unzip/access for these would be helpful.