Skip to content

Latest commit

 

History

History
139 lines (103 loc) · 5.91 KB

File metadata and controls

139 lines (103 loc) · 5.91 KB

ictv-mmseqs2-protein-database

This repository provides instructions and scripts for building an MMseqs2 database of virus proteins annotated with taxonomic data. The database can be used to assign sequences to virus taxa as defined by the International Committee on Taxonomy of Viruses (ICTV). A pre-built database is available for download from Zenodo: doi:10.5281/zenodo.6574913.

Dependencies:

To generate the database on your own, ensure the following dependencies are installed on your system. This repository includes a pixi.toml file, which allows you to use Pixi to easily install all required software.

# Clone the repository
git clone git@github.com:apcamargo/ictv-mmseqs2-protein-database.git
cd ictv-mmseqs2-protein-database
# Install the dependencies and activate the environment
pixi shell

The following tools will be installed:

Building the database

Step 1: Download and process ICTV taxonomy data

Download the ICTV's VMR MSL39 file and convert it to a tabular format:

# Download the data
curl -LJOs https://ictv.global/sites/default/files/VMR/VMR_MSL39.v4_20241106.xlsx

# Convert the xlsx file to a tsv file that contains taxonomic lineages, one per line
csvtk xlsx2csv VMR_MSL39.v4_20241106.xlsx \
    | csvtk csv2tab \
    | sed 's/\xc2\xa0/ /g' \
    | csvtk replace -t -F -f "*" -p "^\s+|\s+$" \
    | csvtk cut -t -f  "Realm,Subrealm,Kingdom,Subkingdom,Phylum,Subphylum,Class,Subclass,Order,Suborder,Family,Subfamily,Genus,Subgenus,Species" \
    | csvtk uniq -t -f "Realm,Subrealm,Kingdom,Subkingdom,Phylum,Subphylum,Class,Subclass,Order,Suborder,Family,Subfamily,Genus,Subgenus,Species" \
    | awk 'NR==1 {$0=tolower($0)} {print}' \
    > ictv_taxonomy.tsv

# Generate a file associating species to virus names
csvtk xlsx2csv VMR_MSL39.v4_20241106.xlsx \
    | csvtk csv2tab \
    | csvtk cut -t -f "Species,Virus name(s)" \
    | csvtk uniq -t -f "Species,Virus name(s)" \
    | awk -v FS="\t" 'NR>1 && $2!=""' \
    > ictv_species_names.tsv

Step 2: Create a taxdump for the ICTV taxonomy

Generate a taxdump of the ICTV taxonomy using taxonkit, then run the fix_taxdump.py script to ensure taxids are sequential and compatible with MMseqs2:

taxonkit create-taxdump ictv_taxonomy.tsv --out-dir ictv_taxdump
python scripts/fix_taxdump.py

For convenience, this repository includes a pre-built ICTV taxdump. To use it, simply extract the ictv_taxdump.tar.gz file instead of generating it from scratch.

Step 3: Download and process NCBI NR data

Download the NCBI taxdump and filter the prot.accession2taxid file to retain only virus proteins:

# Download the NCBI taxdump
curl -LJOs ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
mkdir ncbi_taxdump
tar zxf taxdump.tar.gz -C ncbi_taxdump
rm taxdump.tar.gz
# Download the prot.accession2taxid file and filter only virus proteins
curl -LJs https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz \
    | gzip -dc \
    | awk 'NR > 1 && $2 != 0' \
    | taxonkit lineage -c -i 2 --data-dir ncbi_taxdump \
    | awk '$4 ~ /^Viruses/' \
    | awk -v OFS="\t" '{print $1, $3}' \
    > accession2taxid.tsv

Step 4: Assign ICTV taxids to virus proteins

Run the get_ictv_taxids.py script to assign ICTV taxids to the NR virus proteins:

python scripts/get_ictv_taxids.py

Download and filter the NCBI NR database sequence, retaining only proteins that could be assigned to ICTV taxa:

# Download the NR database and retain only virus proteins
curl -LJs https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz \
    | seqkit grep --quiet -j 4 -f <(cut -f 1 accession2taxid_ictv.tsv) \
    | seqkit seq -i -w 0 \
    | gzip \
    > ictv_nr.faa.gz

Remove proteins from accession2taxid_ictv.tsv that are not present in the filtered NR FASTA file:

# Create a tabular file associating the protein accessions with their taxids
awk 'NR==FNR {ids[$1]; next} $1 in ids' \
    <(seqkit fx2tab -ni ictv_nr.faa.gz) \
    accession2taxid_ictv.tsv \
    > accession2taxid_ictv_nr.tsv

Step 5: Build the MMseqs2 database

Using the filtered NR FASTA, the ICTV taxdump, and the accession2taxid_ictv_nr.tsv file, build a MMseqs2 protein database containing ICTV taxonomy data:

mkdir ictv_nr_db
mmseqs createdb --dbtype 1 ictv_nr.faa.gz ictv_nr_db/ictv_nr_db
mmseqs createtaxdb ictv_nr_db/ictv_nr_db tmp --ncbi-tax-dump ictv_taxdump --tax-mapping-file accession2taxid_ictv_nr.tsv
rm -rf tmp

Using the database for taxonomic assignment

The ictv_nr_db MMseqs2 database can be used to assign taxonomic information to virus genomes. For example, to assign taxonomy to a set of genomes stored in a genomes.fna FASTA file, use the following command:

mmseqs easy-taxonomy genomes.fna ictv_nr_db/ictv_nr_db result tmp --blacklist "" --tax-lineage 1

Usage in the ICTV Computational Virus Taxonomy Challenge

This database was used to support taxonomic classification for a submission to the ICTV Computational Virus Taxonomy Challenge, a community effort to evaluate computational methods for virus taxonomy assignment. The database version VMR_MSL39.v4_20241106, released on 2025-01-26, served as the reference for the ictv-taxonomy-challenge-nr submission.

Acknowledgements

Uri Neri and Lander De Coninck made significant contributions to this repository by developing approaches for systematic retrieval of virus proteins containing taxonomy information.