Skip to content

2a. File formats

jaclew edited this page Nov 21, 2023 · 7 revisions

File Formats

Below are the currently supported input and output formats. For illustrative examples, please refer to the example data directory.

Supported Input Formats:

  • NCBI
  • TSV
  • MP-Style Taxonomies:
    • GTDB
    • QIIME
    • CanSNPer
    • SILVA-like (SILVA with custom columns)

Supported Database Build Programs (Integrated with FlexTaxD):

  • Kraken2
  • Krakenuniq
  • Ganon

Supported Database Input Formatting (Manual Compilation Required):

  • Kraken2
  • Krakenuniq
  • Ganon
  • Centrifuge
  • Bracken

Format Specifications

FlexTaxD Standard TSV Format (ftd)

A standard TSV file for FlexTaxD includes at least two mandatory columns: parent and child, with an optional third column: rank. These columns are separated by tabs, and the file starts with a header row. Each subsequent row defines a parent-child relationship within the taxonomy. The following is a sample with the optional rank column:

parent             child             rank
F.tul.             F.tul.tul.        subspecies
F.tul.             F.tul.hol.        subspecies
F.tul.hol.         B.4               subsubspecies
F.tul.hol.         B.6               subsubspecies

file_formats

Figure 1: Illustration of the tree constructed from the ftd-format example taxonomy above.

Genome Annotation File

The genome annotation file format includes two columns, separated by tabs: the genome sequence identifier (without file extension) and the genome_id. This file works in conjunction with the ftd-format and other MP-formats that do not contain genome sequence information:

GCF_123456789.1    B.4
GCF_987654321.1    B.6

MPI-Formats

GTDB

In the GTDB format, there are two columns separated by tabs. The first is the genome identifier, and the second is the taxonomy string where taxonomic levels are comma-separated and indicated by prefixes followed by double underscores. FlexTaxD allows for an extra level of annotation in QIIME-formatted files indicated by "x__". For example, x__subspecies, shown in the second row below:

GCF_00005111.1    d__Bacteria;...;s__Francisella tularensis
GCF_00005211.1    d__Bacteria;...;s__Francisella tularensis;x__Francisella_tularensis_tularensis

CanSNPer

CanSNPer format uses tabs or semi-colons to separate the taxonomic hierarchy, paired with the genome annotation file for locating genomes:

F.tul.;F.tul.tul.
F.tul.;F.tul.hol.;B.4
F.tul.;F.tul.hol.;B.6

SILVA

SILVA-like format in FlexTaxD uses semi-colons to separate taxonomy levels, and a tab separates the last taxonomy level from the genome identifier:

Bacteria;Proteobacteria;...;Francisella;    26657    genus    132

Greengenes (To be supported)

Greengenes format details will be included in future documentation aligning with the release that supports it.

160827    \tk__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Legionellales;f__Francisellaceae;g__Francisella;
535071    \tk__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Legionellales;f__Francisellaceae;g__Francisella;s__cantonensis; 

NCBI Format

The NCBI format is described in detail at the NCBI taxonomy page. It typically uses a combination of tabs and bar characters to separate the taxonomic hierarchy:

cellular organisms    |    root                |    no rank
Bacteria              |    cellular organisms  |    kingdom
Archaea               |    cellular organisms  |    kingdom

Output Formats

  • TSV - Tab-separated values
  • NCBI Format - The first three columns are in standard NCBI format. Additional columns are included for compatibility with tools like Kraken2, Centrifuge, etc.

Clone this wiki locally