Skip to content

Commit 2ec5e56

Browse files
committed
2.6 readme.txt initial checking
1 parent f624223 commit 2ec5e56

File tree

2 files changed

+189
-0
lines changed

2 files changed

+189
-0
lines changed
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
This page describes the format used to represent the taxonomies that are the inputs and outputs of the Open Tree of Life taxonomy build system.
2+
3+
The format derives from NCBI and is intentionally rudimentary because our needs are minimal. A better format to use in the long run might be [Darwin Core Archive](https://code.google.com/p/gbif-ecat/wiki/DwCArchive), which is what is used by GBIF, EOL, and the Global Names Architecture (GNA).
4+
5+
***
6+
7+
Each source taxonomy (NCBI, GBIF, Index Fungorum, ...) has its own script that converts its
8+
native format into this format.
9+
10+
A taxonomy consists of a directory of files with fixed names. Example: `mycobank/taxonomy.tsv`, `mycobank/synonyms.tsv`, `mycobank/about.md`.
11+
12+
## Character encoding
13+
14+
All files use the UTF-8 character encoding. Native taxonomy files often use some other encoding, so conversion might be necessary. Some aggregated taxonomies on the web have gotten this wrong and are a mess of mixed encodings and spurious re-encodings.
15+
16+
## Taxonomy
17+
18+
### File `taxonomy.tsv`
19+
20+
Four required columns, each column followed by tab - vertical bar - tab (even for the last column, which is unlike NCBI). The taxonomy build tool 'smasher' doesn't require the vertical bars; they are optional although they should be either all present or all absent. But some other consumers of these files may still require the vertical bars.
21+
22+
A header row of column names is recommended, but not required (for `Smasher`). If provided, it looks like:
23+
24+
uid | parent_uid | name | rank |
25+
26+
All following rows are one row per taxon
27+
28+
**Columns:**
29+
30+
1. _identifier_ - an identifier for the taxon, unique within this file. Should be native accession number whenever possible. Usually this is an integer, but it need not be.
31+
2. _parent taxon identifier_ or the empty string if there is no parent (i.e., it's a root).
32+
3. _name_ - arbitrary text for the taxon name; not necessarily unique within the file.
33+
4. _rank_, e.g. species, family, class. Should be all lower case. If no rank is assigned, or the rank is unknown, put "no rank".
34+
35+
Example (from NCBI):
36+
37+
5157 | 1028423 | Ceratocystis | genus |
38+
5156 | 91171 | Gondwanamyces proteae | species |
39+
40+
**Optional additional columns:**
41+
42+
* _sourceinfo_: a comma-separated list of specifiers, each one either a URL or a CURIE. If a URL, it should be either a DOI in the form of a URL, or a link to some other source such as a database. URLs begin 'http://' or 'https://' and DOI URLs begin 'http://dx.doi.org/10.'. A CURIE is an abbreviated URI using a prefix drawn from a known set, e.g. ncbi:1234 is taxon 1234 in the NCBI taxonomy. Other prefixes include gbif:, if: (Index Fungorum), mb: (Mycobank). New prefixes can be added but this is a manual process, please request explicitly.
43+
* _uniqueName_: a human-readable string that is unique to this taxon, typically the taxon name if it is unique, or taxon name followed by "([rank] in [ancestor])" where rank is the taxon's rank and ancestor is an ancestor that is unique to this taxon (among the taxa that have the same name).
44+
* _flags_: a comma-separated list of flags or markers. Usually these are generated by taxonomy synthesis and are used to decide whether a taxon is 'hidden' or not. For example, if there's an 'extinct' flag then it may be desirable to suppress the taxon in an application. See [here](https://github.com/OpenTreeOfLife/taxomachine/blob/master/src/main/java/org/opentree/taxonomy/OTTFlag.java).
45+
46+
### Synonyms
47+
48+
Usually there are synonyms. These go into a second file, `synonyms.tsv`. This file must have a header row
49+
50+
uid | name | type | rank |
51+
52+
The header is necessary because it designates the order of the columns, which can sometimes change. These are the four columns:
53+
54+
* _uid_ - the id for the taxon (from the taxonomy file) that this synonym resolves to
55+
* _name_ - the synonymic taxon name
56+
* _type_ - typically will be 'synonym' but could be any of the NCBI synonym types (authority, common name, etc.)
57+
* _rank_ - currently ignored for taxonomy synthesis.
58+
59+
Example from NCBI:
60+
61+
89373 | Flexibacteraceae | synonym | |
62+
63+
### Metadata
64+
65+
Overall metadata for the taxonomy is placed in a separate file. The metadata format is currently under development. `Smasher` generates this in JSON format as `about.json`, but this file is currently not used programmatically, and is in the process of being overhauled. When generating a taxonomy according to this format in external tools, for now it is best to simply write a markdown or plain text file called `about.md` (in the same directory as `taxonomy.tsv` and `synonyms.tsv`).
66+
67+
The metadata provided in the file should include the source of the taxonomy (article or database) as a URL and any other descriptive information that's available. The purpose of the metadata is not just explanatory but also to explain how to check the correctness of the taxonomy against its source and make corrections and other improvements should the source be updated. When using information from changing sources (databases) the date or dates of retrieval should be recorded.
68+
69+
***
70+
71+
_This page was originally part of the [open tree wiki](https://github.com/OpenTreeOfLife/opentree/wiki/Interim-taxonomy-file-format), and was transferred, since then maintained here on 2014-02-06._

doc/readme.txt

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
This is the Open Tree Taxonomy (OTT) version 2.6, created on 2014-04-11.
2+
3+
The taxonomy was generated using the 'smasher' utility, commit
4+
f624223f31, which resides on github, here:
5+
https://github.com/OpenTreeOfLife/reference-taxonomy/commit/f624223f31767fa1787f3ba2ddad5daa56fd939b
6+
7+
File in this package
8+
====================
9+
10+
ott/taxonomy.tsv
11+
ott/synonyms.tsv
12+
13+
The format of these files is described in doc/Interim-taxonomy-file-format.md
14+
https://github.com/OpenTreeOfLife/reference-taxonomy/blob/master/doc/Interim-taxonomy-file-format.md
15+
16+
ott/hidden.tsv
17+
18+
Report on 'hidden' taxa (incertae sedis and other suppressed taxa).
19+
Columns are OTT id, name, source taxonomy and id, containing major
20+
group, and flags (reasons for hiding).
21+
22+
ott/used-but-hidden.tsv
23+
24+
Subset of hidden.tsv for taxa that are referenced from source trees.
25+
Columns are as for hidden.tsv, with the addition of a column for
26+
Phylografter study number.
27+
28+
ott/conflicts.tsv
29+
30+
Report on taxa that are hidden because they are paraphyletic with
31+
respect to a higher priority taxon. Number at beginning is height
32+
in taxonomic tree of nearest common ancestor with priority taxon
33+
that 'steals' one or more children.
34+
35+
ott/deprecated.tsv
36+
37+
List all of taxa that have been deprecated since version 2.5.
38+
39+
ott/log.tsv
40+
41+
Additional debugging information related to deprecated taxa.
42+
43+
Inputs required to create this version
44+
======================================
45+
46+
The following information is current as of 2014-05-28.
47+
48+
SILVA
49+
Retrieved from
50+
https://www.arb-silva.de/fileadmin/silva_databases/release_115/Exports/SSURef_NR99_115_tax_silva.fasta.tgz
51+
Last-modified date: 2013-09-07
52+
See: Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P,
53+
Peplies J, Glockner FO (2013) The SILVA ribosomal RNA gene
54+
database project: improved data processing and web-based tools.
55+
Nucleic Acids Research 41 (D1): D590-D596.
56+
http://dx.doi.org/10.1093/nar/gks1219
57+
Web site: https://www.arb-silva.de/
58+
59+
Taxonomy from Hibbett et al 2007, with updates through 2014
60+
Retrieved from
61+
http://dx.doi.org/10.6084/m9.figshare.915439
62+
Last-modified date: 2014-03-10
63+
There is a copy in the git repository.
64+
See: A higher-level phylogenetic classification of the Fungi.
65+
DS Hibbett, M Binder, JF Bischoff, M Blackwell, et al.
66+
Mycological Research 111(5):509-547, 2007.
67+
http://dx.doi.org/10.1016/j.mycres.2007.03.004
68+
69+
Index Fungorum
70+
We received database table dumps from Paul Kirk in email in
71+
November-December 2013, and converted them to the interim taxonomy
72+
format using ad hoc scripts. The converted form is here:
73+
http://purl.org/opentree/ott2.6-inputs/tax/if/taxonomy.tsv
74+
http://purl.org/opentree/ott2.6-inputs/tax/if/synonyms.tsv
75+
Web site: http://www.indexfungorum.org/
76+
77+
Lamiales taxonomy from Schaferhof et al 2010
78+
File prepared from figure by Open Tree of Life staff.
79+
There is a copy in the git repository.
80+
See:
81+
Schaferhoff, B., Fleischmann, A., Fischer, E., Albach, D. C.,
82+
Borsch, T., Heubl, G., and Muller, K. F. (2010). Towards resolving
83+
Lamiales relationships: insights from rapidly evolving chloroplast
84+
sequences. BMC evolutionary biology 10(1), 352.
85+
http://dx.doi.org/10.1186/1471-2148-10-352
86+
87+
NCBI Taxonomy
88+
Retrieved from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
89+
Last-modified date: 2014-01-06
90+
As far as we can tell, NCBI does not archive past versions of its
91+
taxonomy. We have captured the applicable version here:
92+
http://purl.org/opentree/ott2.6-inputs/feed/ncbi/taxdump.tar.gz
93+
Web site: https://www.ncbi.nlm.nih.gov/taxonomy
94+
95+
GBIF backbone taxonomy
96+
Retrieved from http://ecat-dev.gbif.org/repository/export/checklist1.zip
97+
Last-modified date: 2013-07-02
98+
GBIF intends to reorganize their data archives and this file may
99+
move. We will make a best effort to maintain the following PURL:
100+
http://purl.org/opentree/gbif-backbone-2013-07-02.zip
101+
In case these links do not work, search gbif.org data sets and use the
102+
following file information for confirmation:
103+
Size: 323093992 bytes
104+
sha1sum: b7c7c19f1835af3f424ce4f2c086c692c1818b90
105+
Format: Zip file containing a Darwin Core Archive
106+
Contains file taxon.txt which has 4416348 lines
107+
Web site: http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c
108+
109+
IRMNG (Interim Register of Marine and Nonmarine Genera)
110+
Retrieved from http://www.cmar.csiro.au/datacentre/downloads/IRMNG_DWC.zip
111+
Last-modified date: 2014-01-12
112+
http://purl.org/opentree/ott2.6-inputs/feed/irmng/in/IRMNG_DWC.zip
113+
Web site: http://www.obis.org.au/irmng/
114+
115+
OTT version 2.5
116+
The previous version of OTT is used only for the purpose of ensuring
117+
identifier choice consistency from one version of OTT to the next.
118+
http://files.opentreeoflife.org/ott/ott2.5.tgz

0 commit comments

Comments
 (0)