up

tpoisot · tpoisot · commit 983c0b7f1355 · 2025-06-30T14:10:31.000-04:00
diff --git a/manuscript.typ b/manuscript.typ
@@ -1,4 +1,4 @@
-#set text(size: 12pt, font: "Inter")
+#set text(size: 11pt, font: "Inter")
 
 #text(weight: "regular", size: 25pt)[NCBITaxonomy.jl - rapid biological names finding and reconciliation]
 
@@ -153,39 +153,6 @@ of this feature on the performance of the package is presented in Table 1.
   kind: table
 ): set figure.caption(position: top)
 
-#figure(
-  placement: bottom,
-table(
-  columns: 10,
-  table.header(
-    [Tool],
-    [Lang.],
-    [Library],
-    [CLI],
-    [Local DB],
-    [Fuzzy],
-    [Case],
-    [Subsets],
-    [Ranks],
-    [Reference],
-  ),
-  [`NCBITaxonomy.jl`], [Julia], [#sym.checkmark], [], [#sym.checkmark], [#sym.checkmark], [#sym.checkmark], [#sym.checkmark], [#sym.checkmark], [This paper],
-  [`taxadb`], [R], [#sym.checkmark], [], [#sym.checkmark], [], [], [#sym.checkmark], [#sym.checkmark], [This paper],
-  [`taxopy`], [Python], [#sym.checkmark], [], [#sym.checkmark], [], [#sym.checkmark], [], [], [This paper],
-  [`rentrez`], [R], [#sym.checkmark], [], [], [], [], [], [#sym.checkmark], [This paper],
-  [`TaxonKit`], [Python], [], [#sym.checkmark], [#sym.checkmark], [], [], [], [], [This paper],
-  [`NCBI-taxonomist`], [Python], [], [#sym.checkmark], [#sym.checkmark], [], [], [], [], [This paper],
-),
-caption: [Comparison of core features of packages offering access to the NCBI
-taxonomic backbone. "Library": ability to be called from code. "CLI": ability to
-work as a command-line tool. "Local DB": ability to store a copy of the database
-locally. "Fuzzy": ability to perform fuzzy matching on inputs. "Case": ability
-to perform case-insensitive search. "Subsets": ability to limit the search to a
-subset of the raw database. "Ranks": ability to limit the search to specific
-taxonomic ranks. The features of the various packages have been determined from
-reading their documentation.]
-) <comparison>
-
 An up-to-date version of the documentation for `NCBITaxonomy.jl` can be found in
 the package's _GitHub_ repository (#link("https://github.com/PoisotLab/NCBITaxonomy.jl")[`PoisotLab/NCBITaxonomy.jl`]), including
 examples and in-line documentation of every method. The package is released
@@ -326,30 +293,6 @@ hemipteran genus _Lisarda_ rather than the class _Lepidosauria_).
 Note that the use of a restricted list of names can have significant performance
 consequences. This is illustrated in @benchmark[Tab.]. When possible, the optimal search strategy is to (i) rely on name filters to ensure that searches are conducted within the appropriate NCBI division, and (ii) only rely on fuzzy matching when the strict or lowercase match fails to return a name, as fuzzy matching can result in order of magnitude more run time and memory footprint. 
 
-
-#figure(
-  placement: bottom,
-table(
-  columns: 5,
-  table.header(
-    [Names list],
-    [Fuzzy matching],
-    [Time (ms)],
-    [Allocations],
-    [Memory footprint],
-  ),
-  [all], [no], [23], [34], [2 KiB],
-  [], [yes], [105], [2580], [25 MiB],
-  [`mammalfilter(true)`], [no], [0.55], [32], [2 KiB],
-  [], [yes], [1.9], [551], [286 KiB],
-  [`primatefilter(true)`], [no], [0.15], [33], [2 KiB],
-  [], [yes], [0.3], [92], [27 KiB],
-),
-caption: [Time and performance of different search strategies for the string `"chimpanzees"`. These numbers were obtained on a single Intel i7-8665U CPU (1.90GHz). Using `"Pan"` as the search string (for which `"chimpanzees"`is a recognized vernacular) gave qualitatively similar results, suggesting
-that there is no performance cost associated with working with synonyms or
-verncular input data.]
-) <benchmark>
-
 == Quality of life functions
 
 In order to facilitate working with names, we provide the `authority` function
@@ -400,4 +343,67 @@ distributed systems was enabled by support provided by Calcul Québec
 the initial code, TP and CJC contributed to API design, and all authors
 contributed to functionalities and usability testing. 
 
-#bibliography("references.bib", style: "biomed-central")
+#pagebreak()
+
+#bibliography("references.bib", style: "biomed-central")
+
+#pagebreak()
+
+
+#figure(
+  placement: auto,
+table(
+  columns: 10,
+  table.header(
+    [Tool],
+    [Lang.],
+    [Library],
+    [CLI],
+    [Local DB],
+    [Fuzzy],
+    [Case],
+    [Subsets],
+    [Ranks],
+    [Reference],
+  ),
+  [`NCBITaxonomy.jl`], [Julia], [#sym.checkmark], [], [#sym.checkmark], [#sym.checkmark], [#sym.checkmark], [#sym.checkmark], [#sym.checkmark], [This paper],
+  [`taxadb`], [R], [#sym.checkmark], [], [#sym.checkmark], [], [], [#sym.checkmark], [#sym.checkmark], [@Norman2020TaxHig],
+  [`taxopy`], [Python], [#sym.checkmark], [], [#sym.checkmark], [], [#sym.checkmark], [], [], [@TAXOPY],
+  [`rentrez`], [R], [#sym.checkmark], [], [], [], [], [], [#sym.checkmark], [@RENTREZ],
+  [`TaxonKit`], [Python], [], [#sym.checkmark], [#sym.checkmark], [], [], [], [], [@TAXONKIT],
+  [`NCBI-taxonomist`], [Python], [], [#sym.checkmark], [#sym.checkmark], [], [], [], [], [@NCBITAXONOMIST],
+),
+caption: [Comparison of core features of packages offering access to the NCBI
+taxonomic backbone. "Library": ability to be called from code. "CLI": ability to
+work as a command-line tool. "Local DB": ability to store a copy of the database
+locally. "Fuzzy": ability to perform fuzzy matching on inputs. "Case": ability
+to perform case-insensitive search. "Subsets": ability to limit the search to a
+subset of the raw database. "Ranks": ability to limit the search to specific
+taxonomic ranks. The features of the various packages have been determined from
+reading their documentation.]
+) <comparison>
+
+#pagebreak()
+
+#figure(
+  placement: auto,
+table(
+  columns: 5,
+  table.header(
+    [Names list],
+    [Fuzzy matching],
+    [Time (ms)],
+    [Allocations],
+    [Memory footprint],
+  ),
+  [all], [no], [23], [34], [2 KiB],
+  [], [yes], [105], [2580], [25 MiB],
+  [`mammalfilter(true)`], [no], [0.55], [32], [2 KiB],
+  [], [yes], [1.9], [551], [286 KiB],
+  [`primatefilter(true)`], [no], [0.15], [33], [2 KiB],
+  [], [yes], [0.3], [92], [27 KiB],
+),
+caption: [Time and performance of different search strategies for the string `"chimpanzees"`. These numbers were obtained on a single Intel i7-8665U CPU (1.90GHz). Using `"Pan"` as the search string (for which `"chimpanzees"`is a recognized vernacular) gave qualitatively similar results, suggesting
+that there is no performance cost associated with working with synonyms or
+verncular input data.]
+) <benchmark>
diff --git a/references.bib b/references.bib
@@ -191,3 +191,58 @@ @article{Walker2020ChaVir
 }
 
 
+@software{TAXOPY,
+  author       = {Antônio Camargo and
+                  Michael Kuhn and
+                  Moritz E. Beber and
+                  Maxime Borry},
+  title        = {apcamargo/taxopy: v0.14.0},
+  month        = feb,
+  year         = 2025,
+  publisher    = {Zenodo},
+  version      = {v0.14.0},
+  doi          = {10.5281/zenodo.14799274},
+  url          = {https://doi.org/10.5281/zenodo.14799274},
+  swhid        = {swh:1:dir:f177879a052c9879462c9212dfef7e6ee03df88e
+                   ;origin=https://doi.org/10.5281/zenodo.6993580;vis
+                   it=swh:1:snp:6bc7116a48d9ddb693e8e5aa9f8176255fbcc
+                   cb9;anchor=swh:1:rel:d9337823a00e9c6c60c8372f3d692
+                   b095c366361;path=apcamargo-taxopy-8236fa3
+                  },
+}
+
+@software{RENTREZ,
+  author       = {Winter, David and
+                  Chamberlain, Scott and
+                  Guangchun, Han},
+  title        = {Rentrez 1.0.0},
+  month        = sep,
+  year         = 2015,
+  publisher    = {Zenodo},
+  doi          = {10.5281/zenodo.32420},
+  url          = {https://doi.org/10.5281/zenodo.32420},
+}
+
+@ARTICLE{TAXONKIT,
+  title        = {{TaxonKit}: A practical and efficient {NCBI} taxonomy toolkit},
+  author       = {Shen, Wei and Ren, Hong},
+  journaltitle = {Yi chuan xue bao [Journal of genetics and genomics]},
+  publisher    = {Elsevier BV},
+  volume       = {48},
+  issue        = {9},
+  pages        = {844--850},
+  date         = {2021-09-20},
+  doi          = {10.1016/j.jgg.2021.03.006}
+}
+
+@ARTICLE{NCBITAXONOMIST,
+  title        = {Collecting and managing taxonomic data with {NCBI}-taxonomist},
+  author       = {Buchmann, Jan P and Holmes, Edward C},
+  journaltitle = {Bioinformatics (Oxford, England)},
+  publisher    = {Oxford University Press (OUP)},
+  volume       = {36},
+  issue        = {22-23},
+  pages        = {5548--5550},
+  date         = {2021-04-01},
+  doi          = {10.1093/bioinformatics/btaa1027}
+}
diff --git a/reviews.typ b/reviews.typ
@@ -0,0 +1,157 @@
+#let response(body) = {
+  block(inset: (left: 1cm))[
+    #text(blue, body)
+  ]
+}
+
+= Reviewer 1
+
+This paper describes a Julia package for identifying and standardizing species
+names in text, with the purpose of ensuring that species are not over-counted,
+mis-identified, or misunderstood. While I have not been able to check the
+software, I note that the code and its repository looks nicely engineered and
+that it is in use in more than one system setup by the authors. While not a new
+concept, the novelty with the software is its combination of features.
+
+#response[We appreciate the feedback by the reviewer, and have addressed all of their comments in the revision.]
+
+My general impression of the paper is that it describes useful software but that
+the text needs work. It seems the abstract and background has been given far
+less attention than the rest of the paper. I am confident that more readers will
+be found if those parts are reworked.
+
+#response[We have updated the abstract, and clarified the background section of the
+manuscript, notably in the last paragraph. We hope that this will help readers
+understand the purpose of the package.
+]
+
+== Specific comments
+
+=== Abstract
+
+I don't think "the NCBI taxonomic backbone" is an established term and, as such, should not be used in the abstract. When googling, at least the top three hits are to the authors' own papers and a preprint of the present manuscript.
+
+#response[Corrected as part of the abstract changes]
+
+I have been programming my whole life and I struggle with the following sentence: "The basic search functions are coupled with quality-of-life functions including case-insensitive search and custom fuzzy string matching to facilitate the amount of information that can be extracted automatically while allowing efficient manual curation and inspection of results." In particular, "quality-of-life functions" and "custom fuzzy string matching" is not helpful for anyone curious about your work.
+
+#response[
+Corrected as part of the abstract changes. We now provide longer list of
+functionalities.
+]
+
+Is relying on the Apache Arrow format a dependency or a feature? If it is an implementation detail, I would say it does not belong in the abstract.
+
+#response[
+Both - we have kept it in the abstract as it allows high-performance access to
+the data.
+]
+
+The abstract does not speak to a broader public. What are the applications of
+the software? The phrase "to facilitate the reconciliation and cleaning of
+taxonomic names" probably only makes sense to a quite narrow audience.
+
+#response[
+We have clarified the list of common issues in taxonomic names that the
+software is intended to correct.
+]
+
+=== Background
+
+The first paragraph reads as a copy of the abstract.
+
+#response[The first paragraph has been reworked.]
+
+Please define "the NCBI taxonomic backbone" before use.
+
+#response[We define taxonomic backbones in the first paragraph.]
+
+"Unambiguously identifying species" should be "Unambiguously identifying species names in text".
+
+#response[Thank you for the suggestion, fixed.]
+
+Avoid "presented below". Write "presented in Table 1" instead. You cannot assume the table in print ends up where you expect it.
+
+#response[Fixed.]
+
+I note that Table 1 has a column "Reference", which is good, but it is empty.
+
+#response[Our apology for the omission, it has been filled-in.]
+
+=== Language
+
+Opening your submitted file in Word, I get spell and grammar warnings on quite trivial mistakes, for example "occuring", "litterature", "to the point were", and more.
+
+#response[These have been fixed]
+
+I also note simple mistakes that are hard for Word to notice: "a string of character"
+
+#response[These have been fixed]
+
+=== Code and code access
+
+I am not a Julia user, but from a general (programming language agnostic) standpoint it looks like well-structured code.
+
+#response[Thank you.]
+
+The Zenodo page is either not existing or it is not accessible to the public.
+
+#response[There was an issue with the link, it has been fixed in the revision.]
+
+The GitHub repository is acessible. It is also setup for and invites for collaboration. There are no instructions for how to install and get started, from what I can find. How much of a Julia user does one need to try this package out? It would be nice with some basic install and get-started instructions. That is extra work I do not want to demand, but it would certainly help with "pickup" of users. For example, I have text would be curious to test your package on, so what would I do?
+
+#response[Installation instructions have been added to the README, and more detailed "gertting started" instructions are in the documentation.]
+
+It does not strike me as important to have details about error handling in the article. It is good programming and it should be boasted as a feature, but such programming details belongs in the package documentation (or README if you want to make it more public), in my humble opinion.
+
+#response[We feel strongly that keeping this code snippet in the text is important, as it will help users adopt it as basis to build pipelines that use the error catching system.]
+
+= Reviewer 2
+
+Poisot and colleagues present their software package aimed at making taxonomic classifications/searches more efficient on a local copy of the NCBI database. This appears to be a useful tool that the bioinformatics community will appreciate.
+
+== Minor editorial comments:
+
+P3: improve what? Maybe ‘improve classifications such as conservation outcomes’.
+
+#response[Clarified as part of changes to the background section.]
+
+P4: should be ‘string of characters’.
+
+#response[Fixed.]
+
+P4: literature is misspelled.
+
+#response[Fixed]
+
+P4: there are ‘(ii)’s, one of them should be ‘(iii)’.
+
+#response[Fixed]
+
+P5: what is a ‘raxonomi’?
+
+#response[Fixed]
+
+P6: ‘the the’ is incorrect.
+
+#response[Fixed]
+
+P7: ‘possible’ should be ‘possibly’.
+
+#response[Fixed]
+
+P7: ‘table has’ should be ‘table currently has’.
+
+#response[Not fixed, "at the time of writing" is specified immediately after in the sentence]
+
+P7: omit ‘at the time of writing’.
+
+#response[Not fixed]
+
+P7: replace ‘search faster’ with ‘searches faster’.
+
+#response[Fixed]
+
+P7: replace ‘search are’ with ‘searches are’
+
+#response[Fixed]