Domain Annotation Pipeline

Data pipeline to provide individual, combined and consensus filtered domain annotations for protein structures using Chainsaw, Merizo and UniDoc.

Install (with docker)

Clone the repo. https://github.com/UCLOrengoGroup/domain-annotation-pipeline

Install Nextflow

python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install nextflow

Install Docker https://docs.docker.com/compose/install/

Build Docker containers and run

docker compose build

Running the workflow

The following runs the debug mode, which uses test data included in this repository. Note: either docker or singularity must be supplied as one the the profile arguments.

nextflow run workflows/annotate.nf -profile debug,docker

Preparing data

The pipeline expects two inputs:

a zip file containing PDB files
a file containing all the ids that should be processed

Given the following directory:

pdb_files/A0A3G5A0R2.pdb
pdb_files/A0A8S5U119.pdb
pdb_files/A0A0B5IZ33.pdb
pdb_files/UPI001E716444.pdb
pdb_files/A0A6C0N656.pdb

Create a zip file from all PDB files in this directory:

cd pdb_files
zip -r ../pdb_files.zip .

Create a file containing all the ids to process:

# list the files in the zip and remove the `.pdb` suffix
zipinfo -1 pdb_files.zip | sed 's/.pdb//g' > ids.txt

Pass these parameters to nextflow:

nextflow run workflows/annotate.nf \
    --pdb_zip_file pdb_files.zip \
    --uniprot_csv_file ids.txt \
    -profile debug,docker

Also useful to note: The output directory can be controlled with the --project_name parameter. The three chunk size parameters control how many IDs are processed concurrently at different stages of the workflow:

--chunk_size
--light_chunk_size
--heavy_chunk_size

The parameter --heavy_chunk_size is used for the run_ted_segmentation process and should be set with maximum memory limits in mind.

Inclusion of Foldseek

The pipeline now runs Foldseek on output domains automatically.

It will download the CATH databases required (currently V4.4.0 s95) to this folder ../foldseek/assets/<url hash>.

If the database URL is changed or the Foldseek assets are missing or deleted, please run the pipeline without -resume to ensure correct download behaviour.

Running on HPC

Install (with singularity)

These instructions are specific to the HPC setup in UCL Computer Sciences:

Clone the GitHub repository
Request access to the NextFlow submit node: askey
Login to askey

Set the following NextFlow environment variables interactively or add to ~/.bashrc.

export NXF_OPTS='-Xms3g -Xmx3g'
export PATH=/share/apps/jdk-20.0.2/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/jdk-20.0.2/lib:$LD_LIBRARY_PATH
export JAVA_HOME=/share/apps/jdk-20.0.2
export PATH=/share/apps/genomics/nextflow-local-23.04.2:$PATH

Create a cache directory for NextFlow (not entirely necessary but will prevent warnings).

mkdir ~/scratch
mkdir ~/scratch/nextflow_singularity_cache
export NXF_SINGULARITY_CACHEDIR=$HOME/Scratch/nextflow_singularity_cache

Set the following Python environment variables interactively or add to ~/.bashrc.

export PATH=/share/apps/python-3.13.0a6-shared/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/python-3.13.0a6-shared/lib:$LD_LIBRARY_PATH
source /share/apps/source_files/python/python-3.13.0a6.source

Set up the venv environment

python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

The latest containers are built and stored in GitHub Container Reposity (ghrc.io) as part of the automated build.

These can be downloaded as singularity images with singularity pull:

Note: the following requires setting up a GitHub personal access token

singularity pull --docker-login domain-annotation-pipeline-script_latest.sif docker://ghcr.io/uclorengogroup/domain-annotation-pipeline-script:main-latest
singularity pull --docker-login domain-annotation-pipeline-cath-af-cli_latest.sif docker://ghcr.io/uclorengogroup/domain-annotation-pipeline-cath-af-cli:main-latest
singularity pull --docker-login domain-annotation-pipeline-ted-tools_latest.sif docker://ghcr.io/uclorengogroup/domain-annotation-pipeline-ted-tools:main-latest

The directory containing these singularity images can be added to your config file, or passed directly to nextflow:

nextflow run workflows/annotate -profile singularity \
    --singularity_image_dir "/path/to/singularity_images"

Name		Name	Last commit message	Last commit date
Latest commit History 425 Commits
.github/workflows		.github/workflows
.vscode		.vscode
assets		assets
conf		conf
docker		docker
fixtures/debug/run_ted_segmentation/output		fixtures/debug/run_ted_segmentation/output
foldseek		foldseek
modules		modules
pdbs/results		pdbs/results
platforms/ucl_cs_cluster		platforms/ucl_cs_cluster
scripts		scripts
tools		tools
workflows		workflows
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
af_ids.txt		af_ids.txt
af_model_urls.txt		af_model_urls.txt
docker-compose.cs.yml		docker-compose.cs.yml
docker-compose.yml		docker-compose.yml
nextflow.config		nextflow.config
requirements.txt		requirements.txt
run_annotate.sh		run_annotate.sh
sub_annotate.sh		sub_annotate.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Domain Annotation Pipeline

Install (with docker)

Running the workflow

Preparing data

Inclusion of Foldseek

Running on HPC

Install (with singularity)

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Domain Annotation Pipeline

Install (with docker)

Running the workflow

Preparing data

Inclusion of Foldseek

Running on HPC

Install (with singularity)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages