OUP parser for references extracted from fulltext#37
OUP parser for references extracted from fulltext#37ehenneken merged 2 commits intoadsabs:masterfrom
Conversation
ehenneken
commented
Mar 16, 2026
- OUP parser for references extracted from fulltext
- unittests for testing of this parser
Thomas-S-Allen
left a comment
There was a problem hiding this comment.
The primary issue below is number 1. I think that one needs to be fixed. The rest depend on the answer to the question or are suggestions.
-
Lines 121-127 are a repeat/overwrite of lines 108-115 with some missing functionality. I'm guessing the block from 108-115 is preferred.
-
For the logging on line 313, do you want it to show the length of the outer list
references, or for the parsed references from the current bibcode blockparsed_references? Currently it is logging the former. -
Can
journalever beNone, or do the cases in lines 74-92 catch all cases? If it can beNone, then line 110 will break. -
Similarly, for
volume, canxmlnode_nodecontents('volume')beNone? If so, line 95 can break. A possible fix:
if not volume: volume_node = self.xmlnode_nodecontents('volume') volume = volume_node.lower().replace('vol', '').strip() if volume_node else '' -
Is it possible for 'author' on line 165 to contain XML? Is that desired or do you want it to be plaintext by that point?
-
The
typevariable name on line 100 is the same name as the built-in python functiontype() -
Also, on line 100, do you want to include the
<mixed-citation publication-type>tag? -
On line 86, will
volumealways be ingroup(1)? If it is possible it's ingroup(2), could produceNoneCould use :volume = match.group(1) or match.group(2) or ''instead. -
The import on line 322 will always run, even when not testing. Can move to inside the
if __name__ = '__main__':block. -
Line 259 could use
super().__init__(...)instead ofXMLtoREFs.__init__(...) -
Line 69, does the XML always use
bookTitleor could it sometimes usebook-title? -
Regex:
Line 38: Do you want to the regex to match __amp;? or amp;? ?
Line 40: I imagine there should only be one <etal> </etal> tag per reference. If there are more though, the regex is greedy and will capture everything between the first <etal> and the last </etal>
Line 110: do you want to replace all occurrences of amp in a journal name like the (contrived) "Journal of Campsite Astronomy & Astrophysics". (That would be a fun journal!)
Unit tests:
-
Since the unit test on line 1250 is running the whole process not just
__init__the name could be something liketest_process_and_dispatch. -
On line 1268 do you want
buffer={}or should it match line 1253 wherebuffer=Noneis used?