Skip to content

Commit 8b88819

Browse files
committed
update
1 parent 26f7cfa commit 8b88819

File tree

3 files changed

+178
-29
lines changed

3 files changed

+178
-29
lines changed

README.md

Lines changed: 33 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,19 @@
11
# Background
22

3-
Unambiguously identifying species is a far more challenging task than it may
4-
appear. There are a vast number of reasons for this. Different databases keep
5-
different taxonomic "backbones", *i.e.* different data structures in which names
6-
are mapped to species, and organised in a hierarchy. Not all names are unique
7-
identifiers to groups. For example, *Io* can either refer to a genus of plants
8-
from the aster family, or to a genus of molluscs; the genus *Mus* (of which the
9-
house mouse *Mus musculus* is a species), contains a sub-genus *also* named
10-
*Mus* (within which *Mus musculus* is located). Conversely, the same species can
11-
have several names, which are valid synonyms: for example, the domestic cow *Bos
12-
taurus* admits *Bos primigenius taurus* as a valid synonym. In addition to
13-
binomial names, the same species can be known by many vernacular (common) names,
14-
which are language or even region-specific: *Ovis aries*, for example, has valid
15-
English vernaculars including lamb, sheep, wild sheep, and domestic sheep.
3+
Unambiguously identifying species names in text is a far more challenging task
4+
than it may appear. There are a vast number of reasons for this. Different
5+
databases keep different taxonomic "backbones", *i.e.* different data structures
6+
in which names are mapped to species, and organised in a hierarchy. Not all
7+
names are unique identifiers to groups. For example, *Io* can either refer to a
8+
genus of plants from the aster family, or to a genus of molluscs; the genus
9+
*Mus* (of which the house mouse *Mus musculus* is a species), contains a
10+
sub-genus *also* named *Mus* (within which *Mus musculus* is located).
11+
Conversely, the same species can have several names, which are valid synonyms:
12+
for example, the domestic cow *Bos taurus* admits *Bos primigenius taurus* as a
13+
valid synonym. In addition to binomial names, the same species can be known by
14+
many vernacular (common) names, which are language or even region-specific:
15+
*Ovis aries*, for example, has valid English vernaculars including lamb, sheep,
16+
wild sheep, and domestic sheep.
1617

1718
In addition, taxonomic nomenclature changes regularly, with groups being split,
1819
merged, or moved to a new position in the tree of life; often, taxonomic
@@ -23,7 +24,7 @@ differ markedly from the last; compare, *e.g* @Lefkowitz2018VirTax to
2324
created within just two years. As a consequence any mapping of names to other
2425
biological entities can become outdated, and therefore invalid. These taxonomic
2526
changes have profound implications for the way we perceive biodiversity at
26-
global scales [@Dikow2009BioRes], to the point were taxonomic revisions should
27+
global scales [@Dikow2009BioRes], to the point where taxonomic revisions should
2728
sometimes be actively conducted to improve *e.g.* conservation outcomes
2829
[@Melville2021RetApp].
2930

@@ -42,7 +43,7 @@ more names), like plant census [@Dauncey2016ComMis; @Wagner2016RevSof;
4243
knowledge of the taxonomy; and as a result of the estimated error in any data
4344
entry exercice, which other fields estimate at up to about 5%
4445
[@Barchard2011PreHum]. As a result, the first question one needs to ask when
45-
confronted with a string of character that purportedly points to a node in the
46+
confronted with a string of characters that purportedly points to a node in the
4647
tree of life is not "to which entry in the taxonomy database is it associated?",
4748
but "is there a mistake in this name that is likely to render a simple lookup
4849
invalid?".
@@ -52,7 +53,7 @@ within and across datasets. Let us consider the hypothetical species survey of
5253
riverine fishes: European chub, *Cyprinus cephalus*, *Leuciscus cephalus*,
5354
*Squalius cephalus*. All are the same species (*S. cephalus*), referred to as
5455
one of the vernacular (European chub) and two formerly accepted names now
55-
classified as synonyms (but still present in the litterature). A simple estimate
56+
classified as synonyms (but still present in the literature). A simple estimate
5657
of diversity based on the user-supplied names would give $n=4$ species, when
5758
there is in fact only one. Some cases can be more difficult to catch; for
5859
example, the species *Isoetes minima* is frequently mentionned as *Isœtes
@@ -66,8 +67,12 @@ decreases.
6667

6768
In this manuscript, we describe `NCBITaxonomy.jl`, a Julia package that provides
6869
advanced name matching and error handling capacities for the reconciliation of
69-
taxonomic names to the NCBI database. This package was used to facilitate the
70-
development of the *CLOVER* [@Gibb2021DatPro] database of host-virus
70+
taxonomic names to the NCBI database. This package works by downloading a local
71+
copy of the taxonomy database, so that queries can be made rapidly, and that
72+
subsequent queries will return the same results. The package offers
73+
functionalities to automatically prompt users to update the local copy of the
74+
taxononmy database if it becomes outdated. This package was used to facilitate
75+
the development of the *CLOVER* [@Gibb2021DatPro] database of host-virus
7176
associations, by reconciling the names of viruses and mammals from four
7277
different sources, where all of the issues described above were present. More
7378
recently, it has become part of the automated curation of data for the *VIRION*
@@ -81,7 +86,7 @@ high-performance name reconciliation.
8186
Based on the author's experience reconciling lists of thousands of biological
8287
names, `NCBITaxonomy.jl` is built around a series of features that allow (i)
8388
maximum flexibility when handling names without a direct match, (ii) a bespoke
84-
exception system to handle failures to match automatically, and (ii) limits to
89+
exception system to handle failures to match automatically, and (iii) limits to
8590
the pool of potential names in order to achieve orders-of-magnitude speedups
8691
when the broad classification of the name to match is known. Adhering to these
8792
design principles led to a number of choices. A comparison of the features of
@@ -101,7 +106,7 @@ be called only after a case-sensitive, non-fuzzy search yields an exception
101106
about the lack of a direct match. Finally, in order to achieve a good
102107
performance even when relying on fuzzy matching, we offer the ability to limit
103108
the search to specific parts of the taxonomy database. An example of the impact
104-
of this feature on the performance of the package is presented below.
109+
of this feature on the performance of the package is presented in Table 1.
105110

106111
| Tool | Lang. | Library | CLI | Local DB | Fuzzy | Case | Subsets | Ranks | Reference |
107112
| ----------------- | -------- | :-----: | :---: | :------: | :---: | :---: | :-----: | :---: | --------- |
@@ -118,7 +123,7 @@ work as a command-line tool. "Local DB": ability to store a copy of the database
118123
locally. "Fuzzy": ability to perform fuzzy matching on inputs. "Case": ability
119124
to perform case-insensitive search. "Subsets": ability to limit the search to a
120125
subset of the raw database. "Ranks": ability to limit the search to specific
121-
raxonomi ranks. The features of the various packages have been determined from
126+
taxonomic ranks. The features of the various packages have been determined from
122127
reading their documentation. {@tbl:id}
123128

124129
An up-to-date version of the documentation for `NCBITaxonomy.jl` can be found in
@@ -184,7 +189,7 @@ in order to create a name finder function (see the next section). The `taxon`
184189
method has additional arguments to perform fuzzy matching in order to catch
185190
possible typos (`taxon("Boops bops"; strict=false)`), to perform a lowercase
186191
search (useful when alphanumeric codes are part of the taxon name, like for some
187-
viruses), and to restrict the the search to a specific taxonomic rank. The
192+
viruses), and to restrict the search to a specific taxonomic rank. The
188193
`taxon` function also accepts a `preferscientificname` keyword, to prevent
189194
matching vernacular names; the use of this keyword ought to be informed by
190195
knowledge about how the data were entered.
@@ -218,7 +223,7 @@ When it succeeds, `taxon` will return a `NCBITaxon` object (made of a `name`
218223
string field, and an `id` numerical field). That being said, the package is
219224
designed under the assumption that ambiguities should yield an error for the
220225
user to handle. There are two such errors: `NameHasNoDirectMatch` (with
221-
instructions about how to possible solve it, using the `similarnames` function),
226+
instructions about how to possibly solve it, using the `similarnames` function),
222227
or a `NameHasMultipleMatches` (listing the possible valid matches, and
223228
suggesting to use `alternativetaxa` to find the correct one). Therefore, the
224229
common way to work with the `taxon` function would be to wrap it in a
@@ -249,12 +254,12 @@ raccoon.
249254

250255
## Name filtering functions
251256

252-
As the full NCBI names table has over 3 million entries at the time of writing,
253-
we have provided a number of functions to restrict the scope of names that are
254-
searched. These are driven by the NCBI *divisions*. For example `nf =
257+
As the full NCBI names table holds over 3 million entries at the time of
258+
writing, we have provided a number of functions to restrict the scope of names
259+
that are searched. These are driven by the NCBI *divisions*. For example `nf =
255260
mammalfilter(true)` will return a data frame containing the names of mammals,
256261
inclusive of rodents and primates, and can be used with *e.g.* `taxon(nf,
257-
"Pan")`. This has the dual advantage of making search faster, but also of
262+
"Pan")`. This has the dual advantage of making queries faster, but also of
258263
avoiding matching on names that are shared by another taxonomic group (which is
259264
not an issue with *Pan*, but is an issue with *e.g.* *Io* as mentioned in the
260265
introduction, or with the common name *Lizard*, which fuzzy-matches on the
@@ -274,7 +279,7 @@ the entire database, in all mammals, and in all primates:
274279
| | yes | 0.3 | 92 | 27 KiB |
275280

276281
Clearly, the optimal search strategy is to (i) rely on name filters to ensure
277-
that search are conducted within the appropriate NCBI division, and (ii) only
282+
that searches are conducted within the appropriate NCBI division, and (ii) only
278283
rely on fuzzy matching when the strict or lowercase match fails to return a
279284
name, as fuzzy matching can result in order of magnitude more run time and
280285
memory footprint. These numbers were obtained on a single Intel i7-8665U CPU (@

metadata.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@
4242
"orcid": "0000-0001-6960-8434"
4343
}
4444
],
45-
"abstract": "`NCBITaxonomy.jl` is a package designed to facilitate the reconciliation and cleaning of taxonomic names, using a local copy of the NCBI taxonomic backbone [@Federhen2012NcbTax; @Schoch2020NcbTax]; The basic search functions are coupled with quality-of-life functions including case-insensitive search and custom fuzzy string matching to facilitate the amount of information that can be extracted automatically while allowing efficient manual curation and inspection of results. `NCBITaxonomy.jl` works with version 1.6 of the Julia programming language [@Bezanson2017JulFre], and relies on the Apache Arrow format to store a local copy of the NCBI raw taxonomy files. The design of `NCBITaxonomy.jl` has been inspired by similar efforts, like the R package `taxadb` [@Norman2020TaxHig], which provides an offline alternative to packages like `taxize` [@Chamberlain2013TaxTax].",
45+
"abstract": "",
4646
"keywords": [
4747
"biodiversity",
4848
"taxonomy",

reviews.md

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
# Reviewer 1
2+
3+
This paper describes a Julia package for identifying and standardizing species
4+
names in text, with the purpose of ensuring that species are not over-counted,
5+
mis-identified, or misunderstood. While I have not been able to check the
6+
software, I note that the code and its repository looks nicely engineered and
7+
that it is in use in more than one system setup by the authors. While not a new
8+
concept, the novelty with the software is its combination of features.
9+
10+
> We appreciate the feedback by the reviewer, and have addressed all of their comments in the revision.
11+
12+
My general impression of the paper is that it describes useful software but that
13+
the text needs work. It seems the abstract and background has been given far
14+
less attention than the rest of the paper. I am confident that more readers will
15+
be found if those parts are reworked.
16+
17+
> We have updated the abstract, and clarified the background section of the
18+
> manuscript, notably in the last paragraph. We hope that this will help readers
19+
> understand the purpose of the package.
20+
21+
# Specific comments
22+
23+
## Abstract
24+
25+
I don't think "the NCBI taxonomic backbone" is an established term and, as such, should not be used in the abstract. When googling, at least the top three hits are to the authors' own papers and a preprint of the present manuscript.
26+
27+
> Corrected as part of the abstract changes
28+
29+
I have been programming my whole life and I struggle with the following sentence: "The basic search functions are coupled with quality-of-life functions including case-insensitive search and custom fuzzy string matching to facilitate the amount of information that can be extracted automatically while allowing efficient manual curation and inspection of results." In particular, "quality-of-life functions" and "custom fuzzy string matching" is not helpful for anyone curious about your work.
30+
31+
> Corrected as part of the abstract changes. We now provide longer list of
32+
> functionalities.
33+
34+
Is relying on the Apache Arrow format a dependency or a feature? If it is an implementation detail, I would say it does not belong in the abstract.
35+
36+
> Both - we have kept it in the abstract as it allows high-performance access to
37+
> the data.
38+
39+
The abstract does not speak to a broader public. What are the applications of
40+
the software? The phrase "to facilitate the reconciliation and cleaning of
41+
taxonomic names" probably only makes sense to a quite narrow audience.
42+
43+
> We have clarified the list of common issues in taxonomic names that the
44+
> software is intended to correct.
45+
46+
## Background
47+
48+
The first paragraph reads as a copy of the abstract.
49+
50+
> The first paragraph has been reworked.
51+
52+
Please define "the NCBI taxonomic backbone" before use.
53+
54+
> We define taxonomic backbones in the first paragraph.
55+
56+
"Unambiguously identifying species" should be "Unambiguously identifying species names in text".
57+
58+
> Thank you for the suggestion, fixed.
59+
60+
Avoid "presented below". Write "presented in Table 1" instead. You cannot assume the table in print ends up where you expect it.
61+
62+
> Fixed.
63+
64+
I note that Table 1 has a column "Reference", which is good, but it is empty.
65+
66+
> TODO
67+
68+
## Language
69+
70+
Opening your submitted file in Word, I get spell and grammar warnings on quite trivial mistakes, for example "occuring", "litterature", "to the point were", and more.
71+
72+
> These have been fixed
73+
74+
I also note simple mistakes that are hard for Word to notice: "a string of character"
75+
76+
> These have been fixed
77+
78+
## Code and code access
79+
80+
I am not a Julia user, but from a general (programming language agnostic) standpoint it looks like well-structured code.
81+
82+
> Thank you.
83+
84+
The Zenodo page is either not existing or it is not accessible to the public.
85+
86+
> TODO
87+
88+
The GitHub repository is acessible. It is also setup for and invites for collaboration. There are no instructions for how to install and get started, from what I can find. How much of a Julia user does one need to try this package out? It would be nice with some basic install and get-started instructions. That is extra work I do not want to demand, but it would certainly help with "pickup" of users. For example, I have text would be curious to test your package on, so what would I do?
89+
90+
> TODO
91+
92+
It does not strike me as important to have details about error handling in the article. It is good programming and it should be boasted as a feature, but such programming details belongs in the package documentation (or README if you want to make it more public), in my humble opinion.
93+
94+
> TODO
95+
96+
# Reviewer 2
97+
98+
Poisot and colleagues present their software package aimed at making taxonomic classifications/searches more efficient on a local copy of the NCBI database. This appears to be a useful tool that the bioinformatics community will appreciate.
99+
100+
## Minor editorial comments:
101+
102+
P3: improve what? Maybe ‘improve classifications such as conservation outcomes’.
103+
104+
> Clarified as part of changes to the background section.
105+
106+
P4: should be ‘string of characters’.
107+
108+
> Fixed.
109+
110+
P4: literature is misspelled.
111+
112+
> Fixed
113+
114+
P4: there are ‘(ii)’s, one of them should be ‘(iii)’.
115+
116+
> Fixed
117+
118+
P5: what is a ‘raxonomi’?
119+
120+
> Fixed
121+
122+
P6: ‘the the’ is incorrect.
123+
124+
> Fixed
125+
126+
P7: ‘possible’ should be ‘possibly’.
127+
128+
> Fixed
129+
130+
P7: ‘table has’ should be ‘table currently has’.
131+
132+
> Not fixed, "at the time of writing" is specified immediately after in the sentence
133+
134+
P7: omit ‘at the time of writing’.
135+
136+
> Not fixed
137+
138+
P7: replace ‘search faster’ with ‘searches faster’.
139+
140+
> Fixed
141+
142+
P7: replace ‘search are’ with ‘searches are’
143+
144+
> Fixed

0 commit comments

Comments
 (0)