-
Notifications
You must be signed in to change notification settings - Fork 9
3. Creating a FlexTaxD database
Taxonomy-files are required to create the FlexTaxD database (Fdb). The creation process involves compiling taxonomic data from sources such as NCBI or GTDB, or using predefined test datasets. Below are detailed instructions on how to obtain these files and create the Fdb for different use cases.
To obtain taxonomy files from the NCBI taxonomy database:
- Download the taxonomic names and nodes:
wget https://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip names.dmp nodes.dmp- For genomic analysis, download the mapping file of taxonomic IDs to genome accession IDs:
wget https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
gunzip nucl_gb.accession2taxid.gz- Download the genomic data (either all bacterial genomes or selected genera):
# All bacterial genomes
ncbi-genome-download bacteria -p 20 -r 50 -l complete,chromosome -F fasta -o genomes
# Selected bacterial genera
ncbi-genome-download bacteria -p 20 -r 50 -l complete,chromosome -F fasta -o genomes --genera "Streptomyces,Escherichia,Yersinia,Francisella"Both steps 2 & 3 need to be completed for the genomes of interest because information from the accession2taxid is found within the fasta-headers of the genome files.
To obtain bacterial taxonomy files from the GTDB database:
wget https://data.gtdb.ecogenomic.org/releases/latest/bac120_taxonomy.tsv.gz
gunzip bac120_taxonomy.tsv.gzTest files can be found in the example_data folder. The relevant files are:
- For FlexTaxD format:
ftd.tree2tax.tsvandgenomes_map.tsv - For CanSNPer format:
cansnp.tree2tax.txtandgenomes_map.tsv - For GTDB format:
gtdb.tsv(subset for francisellaceae)
With genomic data (enables mapping between taxonomy and genome files):
flextaxd --db ncbi.fdb --taxonomy_file nodes.dmp --taxonomy_type NCBI --genomeid2taxid nucl_gb.accession2taxid.gz --genomes_path genomesWithout genomic data (taxonomy-only, without genome file mapping):
flextaxd --db ncbi.fdb --taxonomy_file nodes.dmp --taxonomy_type NCBIflextaxd --db gtdb.fdb --taxonomy_file bac120_taxonomy.tsv --taxonomy_type GTDBFor FlexTaxD format:
flextaxd --db francisellaceae.fdb --taxonomy_file ftd.tree2tax.tsv --genomeid2taxid genomes_map.tsvFor CanSNPer format:
flextaxd --db francisellaceae.fdb --taxonomy_file cansnp.tree2tax.txt --genomeid2taxid genomes_map.tsv --taxonomy_type CanSNPerEnsure that the FlexTaxD software is installed and available in your PATH to run these commands. Depending on the size of the genomic datasets and the performance of your computer, the database creation can take a significant amount of time. It's advisable to run these processes in a screen session or with nohup, especially for large datasets.