-
Notifications
You must be signed in to change notification settings - Fork 9
2a. File formats
Below are the currently supported input and output formats. For illustrative examples, please refer to the example data directory.
- NCBI
- TSV
- MP-Style Taxonomies:
- GTDB
- QIIME
- CanSNPer
- SILVA-like (SILVA with custom columns)
- Kraken2
- Krakenuniq
- Ganon
- Kraken2
- Krakenuniq
- Ganon
- Centrifuge
- Bracken
A standard TSV file for FlexTaxD includes at least two mandatory columns: parent and child, with an optional third column: rank. These columns are separated by tabs, and the file starts with a header row. Each subsequent row defines a parent-child relationship within the taxonomy. The following is a sample with the optional rank column:
parent child rank
F.tul. F.tul.tul. subspecies
F.tul. F.tul.hol. subspecies
F.tul.hol. B.4 subsubspecies
F.tul.hol. B.6 subsubspecies

Figure 1: Illustration of the tree constructed from the ftd-format example taxonomy above.
The genome annotation file format includes two columns, separated by tabs: the genome sequence identifier (without file extension) and the genome name. This file works in conjunction with the ftd-format and other MP-formats that do not contain genome sequence information:
GCF_123456789.1 B.4
GCF_987654321.1 B.6
In the GTDB format, there are two columns separated by tabs. The first is the genome identifier, and the second is the taxonomy string where taxonomic levels are comma-separated and indicated by prefixes followed by double underscores. FlexTaxD allows for an extra level of annotation in QIIME-formatted files indicated by "x__". For example, x__subspecies, shown in the second row below:
GCF_00005111.1 d__Bacteria;...;s__Francisella tularensis
GCF_00005211.1 d__Bacteria;...;s__Francisella tularensis;x__Francisella_tularensis_tularensis
CanSNPer format uses tabs or semi-colons to separate the taxonomic hierarchy, paired with the genome annotation file for locating genomes:
F.tul.;F.tul.tul.
F.tul.;F.tul.hol.;B.4
F.tul.;F.tul.hol.;B.6
SILVA-like format in FlexTaxD uses semi-colons to separate taxonomy levels, and a tab separates the last taxonomy level from the genome identifier:
Bacteria;Proteobacteria;...;Francisella; 26657 genus 132
Greengenes format details will be included in future documentation aligning with the release that supports it.
160827 \tk__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Legionellales;f__Francisellaceae;g__Francisella;
535071 \tk__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Legionellales;f__Francisellaceae;g__Francisella;s__cantonensis;
The NCBI format is described in detail at the NCBI taxonomy page. It typically uses a combination of tabs and bar characters to separate the taxonomic hierarchy:
cellular organisms | root | no rank
Bacteria | cellular organisms | kingdom
Archaea | cellular organisms | kingdom
- TSV - Tab-separated values
- NCBI Format - The first three columns are in standard NCBI format. Additional columns are included for compatibility with tools like Kraken2, Centrifuge, etc.