Skip to content

3. Creating a FlexTaxD database

Andreas Sjödin edited this page Nov 8, 2023 · 1 revision

Creating a FlexTaxD Database

Taxonomy-files are required to create the FlexTaxD database (Fdb). The creation process involves compiling taxonomic data from sources such as NCBI or GTDB, or using predefined test datasets. Below are detailed instructions on how to obtain these files and create the Fdb for different use cases.

Obtaining Taxonomy Data

From NCBI

To obtain taxonomy files from the NCBI taxonomy database:

  1. Download the taxonomic names and nodes:
wget https://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip names.dmp nodes.dmp
  1. For genomic analysis, download the mapping file of taxonomic IDs to genome accession IDs:
wget https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
gunzip nucl_gb.accession2taxid.gz
  1. Download the genomic data (either all bacterial genomes or selected genera):
# All bacterial genomes
ncbi-genome-download bacteria -p 20 -r 50 -l complete,chromosome -F fasta -o genomes

# Selected bacterial genera
ncbi-genome-download bacteria -p 20 -r 50 -l complete,chromosome -F fasta -o genomes --genera "Streptomyces,Escherichia,Yersinia,Francisella"

Both steps 2 & 3 need to be completed for the genomes of interest because information from the accession2taxid is found within the fasta-headers of the genome files.

From GTDB

To obtain bacterial taxonomy files from the GTDB database:

wget https://data.gtdb.ecogenomic.org/releases/latest/bac120_taxonomy.tsv.gz
gunzip bac120_taxonomy.tsv.gz

From Test Files

Test files can be found in the example_data folder. The relevant files are:

  • For FlexTaxD format: ftd.tree2tax.tsv and genomes_map.tsv
  • For CanSNPer format: cansnp.tree2tax.txt and genomes_map.tsv
  • For GTDB format: gtdb.tsv (subset for francisellaceae)

Building the FlexTaxD Database

Using NCBI Taxonomy

With genomic data (enables mapping between taxonomy and genome files):

flextaxd --db ncbi.fdb --taxonomy_file nodes.dmp --taxonomy_type NCBI --genomeid2taxid nucl_gb.accession2taxid.gz --genomes_path genomes

Without genomic data (taxonomy-only, without genome file mapping):

flextaxd --db ncbi.fdb --taxonomy_file nodes.dmp --taxonomy_type NCBI

Using GTDB Taxonomy

flextaxd --db gtdb.fdb --taxonomy_file bac120_taxonomy.tsv --taxonomy_type GTDB

Using Test Datasets Taxonomy (francisellaceae)

For FlexTaxD format:

flextaxd --db francisellaceae.fdb --taxonomy_file ftd.tree2tax.tsv --genomeid2taxid genomes_map.tsv

For CanSNPer format:

flextaxd --db francisellaceae.fdb --taxonomy_file cansnp.tree2tax.txt --genomeid2taxid genomes_map.tsv --taxonomy_type CanSNPer

Ensure that the FlexTaxD software is installed and available in your PATH to run these commands. Depending on the size of the genomic datasets and the performance of your computer, the database creation can take a significant amount of time. It's advisable to run these processes in a screen session or with nohup, especially for large datasets.

Clone this wiki locally