This repository provides instructions and scripts for building an MMseqs2 database of virus proteins annotated with taxonomic data. The database can be used to assign sequences to virus taxa as defined by the International Committee on Taxonomy of Viruses (ICTV). A pre-built database is available for download from Zenodo: doi:10.5281/zenodo.6574913.
To generate the database on your own, ensure the following dependencies are installed on your system. This repository includes a pixi.toml file, which allows you to use Pixi to easily install all required software.
# Clone the repository
git clone git@github.com:apcamargo/ictv-mmseqs2-protein-database.git
cd ictv-mmseqs2-protein-database
# Install the dependencies and activate the environment
pixi shellThe following tools will be installed:
Download the ICTV's VMR MSL39 file and convert it to a tabular format:
# Download the data
curl -LJOs https://ictv.global/sites/default/files/VMR/VMR_MSL39.v4_20241106.xlsx
# Convert the xlsx file to a tsv file that contains taxonomic lineages, one per line
csvtk xlsx2csv VMR_MSL39.v4_20241106.xlsx \
| csvtk csv2tab \
| sed 's/\xc2\xa0/ /g' \
| csvtk replace -t -F -f "*" -p "^\s+|\s+$" \
| csvtk cut -t -f "Realm,Subrealm,Kingdom,Subkingdom,Phylum,Subphylum,Class,Subclass,Order,Suborder,Family,Subfamily,Genus,Subgenus,Species" \
| csvtk uniq -t -f "Realm,Subrealm,Kingdom,Subkingdom,Phylum,Subphylum,Class,Subclass,Order,Suborder,Family,Subfamily,Genus,Subgenus,Species" \
| awk 'NR==1 {$0=tolower($0)} {print}' \
> ictv_taxonomy.tsv
# Generate a file associating species to virus names
csvtk xlsx2csv VMR_MSL39.v4_20241106.xlsx \
| csvtk csv2tab \
| csvtk cut -t -f "Species,Virus name(s)" \
| csvtk uniq -t -f "Species,Virus name(s)" \
| awk -v FS="\t" 'NR>1 && $2!=""' \
> ictv_species_names.tsvGenerate a taxdump of the ICTV taxonomy using taxonkit, then run the fix_taxdump.py script to ensure taxids are sequential and compatible with MMseqs2:
taxonkit create-taxdump ictv_taxonomy.tsv --out-dir ictv_taxdump
python scripts/fix_taxdump.pyFor convenience, this repository includes a pre-built ICTV taxdump. To use it, simply extract the ictv_taxdump.tar.gz file instead of generating it from scratch.
Download the NCBI taxdump and filter the prot.accession2taxid file to retain only virus proteins:
# Download the NCBI taxdump
curl -LJOs ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
mkdir ncbi_taxdump
tar zxf taxdump.tar.gz -C ncbi_taxdump
rm taxdump.tar.gz
# Download the prot.accession2taxid file and filter only virus proteins
curl -LJs https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz \
| gzip -dc \
| awk 'NR > 1 && $2 != 0' \
| taxonkit lineage -c -i 2 --data-dir ncbi_taxdump \
| awk '$4 ~ /^Viruses/' \
| awk -v OFS="\t" '{print $1, $3}' \
> accession2taxid.tsvRun the get_ictv_taxids.py script to assign ICTV taxids to the NR virus proteins:
python scripts/get_ictv_taxids.pyDownload and filter the NCBI NR database sequence, retaining only proteins that could be assigned to ICTV taxa:
# Download the NR database and retain only virus proteins
curl -LJs https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz \
| seqkit grep --quiet -j 4 -f <(cut -f 1 accession2taxid_ictv.tsv) \
| seqkit seq -i -w 0 \
| gzip \
> ictv_nr.faa.gzRemove proteins from accession2taxid_ictv.tsv that are not present in the filtered NR FASTA file:
# Create a tabular file associating the protein accessions with their taxids
awk 'NR==FNR {ids[$1]; next} $1 in ids' \
<(seqkit fx2tab -ni ictv_nr.faa.gz) \
accession2taxid_ictv.tsv \
> accession2taxid_ictv_nr.tsvUsing the filtered NR FASTA, the ICTV taxdump, and the accession2taxid_ictv_nr.tsv file, build a MMseqs2 protein database containing ICTV taxonomy data:
mkdir ictv_nr_db
mmseqs createdb --dbtype 1 ictv_nr.faa.gz ictv_nr_db/ictv_nr_db
mmseqs createtaxdb ictv_nr_db/ictv_nr_db tmp --ncbi-tax-dump ictv_taxdump --tax-mapping-file accession2taxid_ictv_nr.tsv
rm -rf tmpThe ictv_nr_db MMseqs2 database can be used to assign taxonomic information to virus genomes. For example, to assign taxonomy to a set of genomes stored in a genomes.fna FASTA file, use the following command:
mmseqs easy-taxonomy genomes.fna ictv_nr_db/ictv_nr_db result tmp --blacklist "" --tax-lineage 1This database was used to support taxonomic classification for a submission to the ICTV Computational Virus Taxonomy Challenge, a community effort to evaluate computational methods for virus taxonomy assignment. The database version VMR_MSL39.v4_20241106, released on 2025-01-26, served as the reference for the ictv-taxonomy-challenge-nr submission.
Uri Neri and Lander De Coninck made significant contributions to this repository by developing approaches for systematic retrieval of virus proteins containing taxonomy information.