update

tpoisot · tpoisot · commit 8b88819ef127 · 2025-06-30T13:17:49.000-04:00
diff --git a/README.md b/README.md
@@ -1,18 +1,19 @@
 # Background
 
-Unambiguously identifying species is a far more challenging task than it may
-appear. There are a vast number of reasons for this. Different databases keep
-different taxonomic "backbones", *i.e.* different data structures in which names
-are mapped to species, and organised in a hierarchy. Not all names are unique
-identifiers to groups. For example, *Io* can either refer to a genus of plants
-from the aster family, or to a genus of molluscs; the genus *Mus* (of which the
-house mouse *Mus musculus* is a species), contains a sub-genus *also* named
-*Mus* (within which *Mus musculus* is located). Conversely, the same species can
-have several names, which are valid synonyms: for example, the domestic cow *Bos
-taurus* admits *Bos primigenius taurus* as a valid synonym. In addition to
-binomial names, the same species can be known by many vernacular (common) names,
-which are language or even region-specific: *Ovis aries*, for example, has valid
-English vernaculars including lamb, sheep, wild sheep, and domestic sheep.
+Unambiguously identifying species names in text is a far more challenging task
+than it may appear. There are a vast number of reasons for this. Different
+databases keep different taxonomic "backbones", *i.e.* different data structures
+in which names are mapped to species, and organised in a hierarchy. Not all
+names are unique identifiers to groups. For example, *Io* can either refer to a
+genus of plants from the aster family, or to a genus of molluscs; the genus
+*Mus* (of which the house mouse *Mus musculus* is a species), contains a
+sub-genus *also* named *Mus* (within which *Mus musculus* is located).
+Conversely, the same species can have several names, which are valid synonyms:
+for example, the domestic cow *Bos taurus* admits *Bos primigenius taurus* as a
+valid synonym. In addition to binomial names, the same species can be known by
+many vernacular (common) names, which are language or even region-specific:
+*Ovis aries*, for example, has valid English vernaculars including lamb, sheep,
+wild sheep, and domestic sheep.
 
 In addition, taxonomic nomenclature changes regularly, with groups being split,
 merged, or moved to a new position in the tree of life; often, taxonomic
@@ -23,7 +24,7 @@ differ markedly from the last; compare, *e.g* @Lefkowitz2018VirTax to
 created within just two years. As a consequence any mapping of names to other
 biological entities can become outdated, and therefore invalid. These taxonomic
 changes have profound implications for the way we perceive biodiversity at
-global scales [@Dikow2009BioRes], to the point were taxonomic revisions should
+global scales [@Dikow2009BioRes], to the point where taxonomic revisions should
 sometimes be actively conducted to improve *e.g.* conservation outcomes
 [@Melville2021RetApp].
 
@@ -42,7 +43,7 @@ more names), like plant census [@Dauncey2016ComMis; @Wagner2016RevSof;
 knowledge of the taxonomy; and as a result of the estimated error in any data
 entry exercice, which other fields estimate at up to about 5%
 [@Barchard2011PreHum]. As a result, the first question one needs to ask when
-confronted with a string of character that purportedly points to a node in the
+confronted with a string of characters that purportedly points to a node in the
 tree of life is not "to which entry in the taxonomy database is it associated?",
 but "is there a mistake in this name that is likely to render a simple lookup
 invalid?".
@@ -52,7 +53,7 @@ within and across datasets. Let us consider the hypothetical species survey of
 riverine fishes: European chub, *Cyprinus cephalus*, *Leuciscus cephalus*,
 *Squalius cephalus*. All are the same species (*S. cephalus*), referred to as
 one of the vernacular (European chub) and two formerly accepted names now
-classified as synonyms (but still present in the litterature). A simple estimate
+classified as synonyms (but still present in the literature). A simple estimate
 of diversity based on the user-supplied names would give $n=4$ species, when
 there is in fact only one. Some cases can be more difficult to catch; for
 example, the species *Isoetes minima* is frequently mentionned as *Isœtes
@@ -66,8 +67,12 @@ decreases.
 
 In this manuscript, we describe `NCBITaxonomy.jl`, a Julia package that provides
 advanced name matching and error handling capacities for the reconciliation of
-taxonomic names to the NCBI database. This package was used to facilitate the
-development of the *CLOVER* [@Gibb2021DatPro] database of host-virus
+taxonomic names to the NCBI database. This package works by downloading a local
+copy of the taxonomy database, so that queries can be made rapidly, and that
+subsequent queries will return the same results. The package offers
+functionalities to automatically prompt users to update the local copy of the
+taxononmy database if it becomes outdated. This package was used to facilitate
+the development of the *CLOVER* [@Gibb2021DatPro] database of host-virus
 associations, by reconciling the names of viruses and mammals from four
 different sources, where all of the issues described above were present. More
 recently, it has become part of the automated curation of data for the *VIRION*
@@ -81,7 +86,7 @@ high-performance name reconciliation.
 Based on the author's experience reconciling lists of thousands of biological
 names, `NCBITaxonomy.jl` is built around a series of features that allow (i)
 maximum flexibility when handling names without a direct match, (ii) a bespoke
-exception system to handle failures to match automatically, and (ii) limits to
+exception system to handle failures to match automatically, and (iii) limits to
 the pool of potential names in order to achieve orders-of-magnitude speedups
 when the broad classification of the name to match is known. Adhering to these
 design principles led to a number of choices. A comparison of the features of
@@ -101,7 +106,7 @@ be called only after a case-sensitive, non-fuzzy search yields an exception
 about the lack of a direct match. Finally, in order to achieve a good
 performance even when relying on fuzzy matching, we offer the ability to limit
 the search to specific parts of the taxonomy database. An example of the impact
-of this feature on the performance of the package is presented below.
+of this feature on the performance of the package is presented in Table 1.
 
 | Tool              | Lang.    | Library |  CLI  | Local DB | Fuzzy | Case  | Subsets | Ranks | Reference |
 | ----------------- | -------- | :-----: | :---: | :------: | :---: | :---: | :-----: | :---: | --------- |
@@ -118,7 +123,7 @@ work as a command-line tool. "Local DB": ability to store a copy of the database
 locally. "Fuzzy": ability to perform fuzzy matching on inputs. "Case": ability
 to perform case-insensitive search. "Subsets": ability to limit the search to a
 subset of the raw database. "Ranks": ability to limit the search to specific
-raxonomi ranks. The features of the various packages have been determined from
+taxonomic ranks. The features of the various packages have been determined from
 reading their documentation. {@tbl:id}
 
 An up-to-date version of the documentation for `NCBITaxonomy.jl` can be found in
@@ -184,7 +189,7 @@ in order to create a name finder function (see the next section). The `taxon`
 method has additional arguments to perform fuzzy matching in order to catch
 possible typos (`taxon("Boops bops"; strict=false)`), to perform a lowercase
 search (useful when alphanumeric codes are part of the taxon name, like for some
-viruses), and to restrict the the search to a specific taxonomic rank. The
+viruses), and to restrict the search to a specific taxonomic rank. The
 `taxon` function also accepts a `preferscientificname` keyword, to prevent
 matching vernacular names; the use of this keyword ought to be informed by
 knowledge about how the data were entered.
@@ -218,7 +223,7 @@ When it succeeds, `taxon` will return a `NCBITaxon` object (made of a `name`
 string field, and an `id` numerical field). That being said, the package is
 designed under the assumption that ambiguities should yield an error for the
 user to handle. There are two such errors: `NameHasNoDirectMatch` (with
-instructions about how to possible solve it, using the `similarnames` function),
+instructions about how to possibly solve it, using the `similarnames` function),
 or a `NameHasMultipleMatches` (listing the possible valid matches, and
 suggesting to use `alternativetaxa` to find the correct one). Therefore, the
 common way to work with the `taxon` function would be to wrap it in a
@@ -249,12 +254,12 @@ raccoon.
 
 ## Name filtering functions
 
-As the full NCBI names table has over 3 million entries at the time of writing,
-we have provided a number of functions to restrict the scope of names that are
-searched. These are driven by the NCBI *divisions*. For example `nf =
+As the full NCBI names table holds over 3 million entries at the time of
+writing, we have provided a number of functions to restrict the scope of names
+that are searched. These are driven by the NCBI *divisions*. For example `nf =
 mammalfilter(true)` will return a data frame containing the names of mammals,
 inclusive of rodents and primates, and can be used with *e.g.* `taxon(nf,
-"Pan")`. This has the dual advantage of making search faster, but also of
+"Pan")`. This has the dual advantage of making queries faster, but also of
 avoiding matching on names that are shared by another taxonomic group (which is
 not an issue with *Pan*, but is an issue with *e.g.* *Io* as mentioned in the
 introduction, or with the common name *Lizard*, which fuzzy-matches on the
@@ -274,7 +279,7 @@ the entire database, in all mammals, and in all primates:
 |                      |      yes       | 0.3       | 92          | 27 KiB           |
 
 Clearly, the optimal search strategy is to (i) rely on name filters to ensure
-that search are conducted within the appropriate NCBI division, and (ii) only
+that searches are conducted within the appropriate NCBI division, and (ii) only
 rely on fuzzy matching when the strict or lowercase match fails to return a
 name, as fuzzy matching can result in order of magnitude more run time and
 memory footprint. These numbers were obtained on a single Intel i7-8665U CPU (@
diff --git a/metadata.json b/metadata.json
@@ -42,7 +42,7 @@
       "orcid": "0000-0001-6960-8434"
     }
   ],
-  "abstract": "`NCBITaxonomy.jl` is a package designed to facilitate the reconciliation and cleaning of taxonomic names, using a local copy of the NCBI taxonomic backbone [@Federhen2012NcbTax; @Schoch2020NcbTax]; The basic search functions are coupled with quality-of-life functions including case-insensitive search and custom fuzzy string matching to facilitate the amount of information that can be extracted automatically while allowing efficient manual curation and inspection of results. `NCBITaxonomy.jl` works with version 1.6 of the Julia programming language [@Bezanson2017JulFre], and relies on the Apache Arrow format to store a local copy of the NCBI raw taxonomy files. The design of `NCBITaxonomy.jl` has been inspired by similar efforts, like the R package `taxadb` [@Norman2020TaxHig], which provides an offline alternative to packages like `taxize` [@Chamberlain2013TaxTax].",
+  "abstract": "",
   "keywords": [
     "biodiversity",
     "taxonomy",
diff --git a/reviews.md b/reviews.md
@@ -0,0 +1,144 @@
+# Reviewer 1
+
+This paper describes a Julia package for identifying and standardizing species
+names in text, with the purpose of ensuring that species are not over-counted,
+mis-identified, or misunderstood. While I have not been able to check the
+software, I note that the code and its repository looks nicely engineered and
+that it is in use in more than one system setup by the authors. While not a new
+concept, the novelty with the software is its combination of features.
+
+> We appreciate the feedback by the reviewer, and have addressed all of their comments in the revision.
+
+My general impression of the paper is that it describes useful software but that
+the text needs work. It seems the abstract and background has been given far
+less attention than the rest of the paper. I am confident that more readers will
+be found if those parts are reworked.
+
+> We have updated the abstract, and clarified the background section of the
+> manuscript, notably in the last paragraph. We hope that this will help readers
+> understand the purpose of the package.
+
+# Specific comments
+
+## Abstract
+
+I don't think "the NCBI taxonomic backbone" is an established term and, as such, should not be used in the abstract. When googling, at least the top three hits are to the authors' own papers and a preprint of the present manuscript.
+
+> Corrected as part of the abstract changes
+
+I have been programming my whole life and I struggle with the following sentence: "The basic search functions are coupled with quality-of-life functions including case-insensitive search and custom fuzzy string matching to facilitate the amount of information that can be extracted automatically while allowing efficient manual curation and inspection of results." In particular, "quality-of-life functions" and "custom fuzzy string matching" is not helpful for anyone curious about your work.
+
+> Corrected as part of the abstract changes. We now provide longer list of
+> functionalities.
+
+Is relying on the Apache Arrow format a dependency or a feature? If it is an implementation detail, I would say it does not belong in the abstract.
+
+> Both - we have kept it in the abstract as it allows high-performance access to
+> the data.
+
+The abstract does not speak to a broader public. What are the applications of
+the software? The phrase "to facilitate the reconciliation and cleaning of
+taxonomic names" probably only makes sense to a quite narrow audience.
+
+> We have clarified the list of common issues in taxonomic names that the
+> software is intended to correct.
+
+## Background
+
+The first paragraph reads as a copy of the abstract.
+
+> The first paragraph has been reworked.
+
+Please define "the NCBI taxonomic backbone" before use.
+
+> We define taxonomic backbones in the first paragraph.
+
+"Unambiguously identifying species" should be "Unambiguously identifying species names in text".
+
+> Thank you for the suggestion, fixed.
+
+Avoid "presented below". Write "presented in Table 1" instead. You cannot assume the table in print ends up where you expect it.
+
+> Fixed.
+
+I note that Table 1 has a column "Reference", which is good, but it is empty.
+
+> TODO
+
+## Language
+
+Opening your submitted file in Word, I get spell and grammar warnings on quite trivial mistakes, for example "occuring", "litterature", "to the point were", and more.
+
+> These have been fixed
+
+I also note simple mistakes that are hard for Word to notice: "a string of character"
+
+> These have been fixed
+
+## Code and code access
+
+I am not a Julia user, but from a general (programming language agnostic) standpoint it looks like well-structured code.
+
+> Thank you.
+
+The Zenodo page is either not existing or it is not accessible to the public.
+
+> TODO
+
+The GitHub repository is acessible. It is also setup for and invites for collaboration. There are no instructions for how to install and get started, from what I can find. How much of a Julia user does one need to try this package out? It would be nice with some basic install and get-started instructions. That is extra work I do not want to demand, but it would certainly help with "pickup" of users. For example, I have text would be curious to test your package on, so what would I do?
+
+> TODO
+
+It does not strike me as important to have details about error handling in the article. It is good programming and it should be boasted as a feature, but such programming details belongs in the package documentation (or README if you want to make it more public), in my humble opinion.
+
+> TODO
+
+# Reviewer 2
+
+Poisot and colleagues present their software package aimed at making taxonomic classifications/searches more efficient on a local copy of the NCBI database. This appears to be a useful tool that the bioinformatics community will appreciate.
+
+## Minor editorial comments:
+
+P3: improve what? Maybe ‘improve classifications such as conservation outcomes’.
+
+> Clarified as part of changes to the background section.
+
+P4: should be ‘string of characters’.
+
+> Fixed.
+
+P4: literature is misspelled.
+
+> Fixed
+
+P4: there are ‘(ii)’s, one of them should be ‘(iii)’.
+
+> Fixed
+
+P5: what is a ‘raxonomi’?
+
+> Fixed
+
+P6: ‘the the’ is incorrect.
+
+> Fixed
+
+P7: ‘possible’ should be ‘possibly’.
+
+> Fixed
+
+P7: ‘table has’ should be ‘table currently has’.
+
+> Not fixed, "at the time of writing" is specified immediately after in the sentence
+
+P7: omit ‘at the time of writing’.
+
+> Not fixed
+
+P7: replace ‘search faster’ with ‘searches faster’.
+
+> Fixed
+
+P7: replace ‘search are’ with ‘searches are’
+
+> Fixed

Original file line number	Diff line number	Diff line change
`@@ -42,7 +42,7 @@`
`42`	`42`	`"orcid": "0000-0001-6960-8434"`
`43`	`43`	`}`
`44`	`44`	`],`
`45`		- "abstract": "`NCBITaxonomy.jl` is a package designed to facilitate the reconciliation and cleaning of taxonomic names, using a local copy of the NCBI taxonomic backbone [@Federhen2012NcbTax; @Schoch2020NcbTax]; The basic search functions are coupled with quality-of-life functions including case-insensitive search and custom fuzzy string matching to facilitate the amount of information that can be extracted automatically while allowing efficient manual curation and inspection of results. `NCBITaxonomy.jl` works with version 1.6 of the Julia programming language [@Bezanson2017JulFre], and relies on the Apache Arrow format to store a local copy of the NCBI raw taxonomy files. The design of `NCBITaxonomy.jl` has been inspired by similar efforts, like the R package `taxadb` [@Norman2020TaxHig], which provides an offline alternative to packages like `taxize` [@Chamberlain2013TaxTax].",
	`45`	`+ "abstract": "",`
`46`	`46`	`"keywords": [`
`47`	`47`	`"biodiversity",`
`48`	`48`	`"taxonomy",`