-
Notifications
You must be signed in to change notification settings - Fork 9
8c. Building an NCBI database and optional modification
This example covers the creation of a FlexTaxD database (Fdb) using NCBI taxonomy files followed by a compilation of the Fdb to a Kraken2-database.
Create environment using mamba (or conda) and install FlexTaxD with dependencies for visualisation and compilation of Kraken2 database:
mamba create -n flextaxd_example flextaxd ncbi-datasets-cli inquirer biopython matplotlib kraken2
conda activate flextaxd_example
Obtain taxonomy-files:
# Download
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
# Unpack names.dmp and nodes.dmp from taxdmp.zip
unzip taxdmp.zip names.dmp nodes.dmp
Obtain genome sequence files:
# Download genomes at "chromosome" or "complete" levels from RefSeq. Warning: This operation downloads a ~45 GB file that needs to be decompressed.
datasets download genome taxon --assembly-source RefSeq --assembly-level chromosome,complete Bacteria
# Unpack the downloaded ZIP-file
unzip ncbi_dataset.zip
# Move the downloaded genomes to a new folder
mkdir genomes
find ncbi_dataset -name "*fna" -exec mv {} genomes \;
# Compress the downloaded genomes
find genomes -name "*fna" -exec gzip {} \;
# Compression alternative, using 'parallel' and 20 processes
find genomes -name "*fna" | parallel -j 20 'gzip {}'
Alternatively, specific taxonomy may be downloaded:
datasets download genome taxon --assembly-source RefSeq --assembly-level chromosome,complete Francisella --filename francisella.zip
datasets download genome taxon --assembly-source RefSeq --assembly-level chromosome,complete Yersinia --filename yersinia.zip
datasets download genome taxon --assembly-source RefSeq --assembly-level chromosome,complete Escherichia --filename escherichia.zip
The zip-files are unpacked and processed similar to above.
Create the Fdb:
flextaxd --db ncbi.fdb --taxonomy_file nodes.dmp --taxonomy_type NCBI --genomeid2taxid nucl_gb.accession2taxid.gz --genomes_path genomes
Clean the Fdb to remove non-annotated genomes with no downloaded sequences:
# For some reason, without --taxonomy_type NCBI specified there is an error during database clean. Need to check it with David...
flextaxd --db ncbi.fdb --clean_database --taxonomy_type NCBI
Building the Kraken2 database:
flextaxd-create --db ncbi.fdb --genomes_path genomes --dbprogram kraken2 --create_db --db_name kraken2.ncbi_bacteria_refseq --processes 20
The NCBI-database built above provides general resolution of bacteria. However, specialists may want to modify taxonomy of interest. Below, the tularensis group is modified with custom taxonomy to increase precision.
Obtain modification-files (taxonomy and genome-map):
wget https://github.com/FOI-Bioinformatics/flextaxd/raw/master/wiki/example_data/tularensis/ftd.tree2tax.tul.tsv
wget https://github.com/FOI-Bioinformatics/flextaxd/raw/master/wiki/example_data/tularensis/genomes_map.tul.tsv
Make a copy of the NCBI-database for modification:
cp ncbi.fdb ncbi_tularensis.fdb
Visualise the Francisellaceae sub-tree of the Fdb:
flextaxd --db ncbi_tularensis.fdb --vis_type tree --vis_node Francisellaceae --vis_depth 0 --vis_label_size 7
Figure 1: The NCBI database, showing Francisellaceae-subtree. The tularensis node is indicated by a red box.
Expanding the tularensis node with custom taxonomy:
flextaxd --db ncbi_tularensis.fdb --mod_file ftd.tree2tax.tul.tsv --genomeid2taxid genomes_map.tul.tsv --parent "Francisella tularensis" --replace
Visualise the Francisellaceae sub-tree of the Fdb, after modification:
flextaxd --db ncbi_tularensis.fdb --vis_type tree --vis_node Francisellaceae --vis_depth 0 --vis_label_size 7
Figure 2: The NCBI database, showing Francisellaceae-subtree including the modification of tularensis (indicated by a red box).
Building the Kraken2 database:
flextaxd-create --db ncbi_tularensis.fdb --genomes_path genomes --dbprogram kraken2 --create_db --db_name kraken2.ncbi_bac120_tularensis --processes 20
# When prompted, let FlexTaxD download the genomes of the tularensis-expansion:
>There is a discrepancy of genomes found in the database and the specified genome-folder, 36325 genomes were found and 9 genomes are missing.
>You may want to purge your database from missing genomes using "flextaxd --purge_database"
>Do you want to download these genomes from NCBI? (y)es, (n)o, (c)ancel: y