Data pipeline to provide individual, combined and consensus filtered domain annotations for protein structures using Chainsaw, Merizo and UniDoc.
Clone the repo. https://github.com/UCLOrengoGroup/domain-annotation-pipeline
Install Nextflow
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install nextflow
Install Docker https://docs.docker.com/compose/install/
Build Docker containers and run
docker compose buildThe following runs the debug mode, which uses test data included in this repository.
Note: either docker or singularity must be supplied as one the the profile arguments.
nextflow run workflows/annotate.nf -profile debug,docker
The pipeline expects two inputs:
- a zip file containing PDB files
- a file containing all the ids that should be processed
Given the following directory:
pdb_files/A0A3G5A0R2.pdb
pdb_files/A0A8S5U119.pdb
pdb_files/A0A0B5IZ33.pdb
pdb_files/UPI001E716444.pdb
pdb_files/A0A6C0N656.pdbCreate a zip file from all PDB files in this directory:
cd pdb_files
zip -r ../pdb_files.zip .Create a file containing all the ids to process:
# list the files in the zip and remove the `.pdb` suffix
zipinfo -1 pdb_files.zip | sed 's/.pdb//g' > ids.txtPass these parameters to nextflow:
nextflow run workflows/annotate.nf \
--pdb_zip_file pdb_files.zip \
--uniprot_csv_file ids.txt \
-profile debug,dockerAlso useful to note:
The output directory can be controlled with the --project_name parameter.
The three chunk size parameters control how many IDs are processed concurrently at different stages of the workflow:
--chunk_size
--light_chunk_size
--heavy_chunk_sizeThe parameter --heavy_chunk_size is used for the run_ted_segmentation process and should be set with maximum memory limits in mind.
The pipeline now runs Foldseek on output domains automatically.
It will download the CATH databases required (currently V4.4.0 s95) to this folder ../foldseek/assets/<url hash>.
If the database URL is changed or the Foldseek assets are missing or deleted, please run the pipeline without -resume to ensure correct download behaviour.
These instructions are specific to the HPC setup in UCL Computer Sciences:
- Clone the GitHub repository
- Request access to the NextFlow submit node:
askey - Login to
askey
Set the following NextFlow environment variables interactively or add to ~/.bashrc.
export NXF_OPTS='-Xms3g -Xmx3g'
export PATH=/share/apps/jdk-20.0.2/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/jdk-20.0.2/lib:$LD_LIBRARY_PATH
export JAVA_HOME=/share/apps/jdk-20.0.2
export PATH=/share/apps/genomics/nextflow-local-23.04.2:$PATHCreate a cache directory for NextFlow (not entirely necessary but will prevent warnings).
mkdir ~/scratch
mkdir ~/scratch/nextflow_singularity_cache
export NXF_SINGULARITY_CACHEDIR=$HOME/Scratch/nextflow_singularity_cacheSet the following Python environment variables interactively or add to ~/.bashrc.
export PATH=/share/apps/python-3.13.0a6-shared/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/python-3.13.0a6-shared/lib:$LD_LIBRARY_PATH
source /share/apps/source_files/python/python-3.13.0a6.sourceSet up the venv environment
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtThe latest containers are built and stored in GitHub Container Reposity (ghrc.io) as part of the automated build.
These can be downloaded as singularity images with singularity pull:
Note: the following requires setting up a GitHub personal access token
singularity pull --docker-login domain-annotation-pipeline-script_latest.sif docker://ghcr.io/uclorengogroup/domain-annotation-pipeline-script:main-latest
singularity pull --docker-login domain-annotation-pipeline-cath-af-cli_latest.sif docker://ghcr.io/uclorengogroup/domain-annotation-pipeline-cath-af-cli:main-latest
singularity pull --docker-login domain-annotation-pipeline-ted-tools_latest.sif docker://ghcr.io/uclorengogroup/domain-annotation-pipeline-ted-tools:main-latestThe directory containing these singularity images can be added to your config file, or passed directly to nextflow:
nextflow run workflows/annotate -profile singularity \
--singularity_image_dir "/path/to/singularity_images"