OUP parser for references extracted from fulltext by ehenneken · Pull Request #37 · adsabs/ADSReferencePipeline

ehenneken · 2026-03-16T19:10:52Z

OUP parser for references extracted from fulltext
unittests for testing of this parser

Thomas-S-Allen

The primary issue below is number 1. I think that one needs to be fixed. The rest depend on the answer to the question or are suggestions.

Lines 121-127 are a repeat/overwrite of lines 108-115 with some missing functionality. I'm guessing the block from 108-115 is preferred.
For the logging on line 313, do you want it to show the length of the outer list references, or for the parsed references from the current bibcode block parsed_references? Currently it is logging the former.
Can journal ever be None, or do the cases in lines 74-92 catch all cases? If it can be None, then line 110 will break.
Similarly, for volume, can xmlnode_nodecontents('volume') be None? If so, line 95 can break. A possible fix:
if not volume: volume_node = self.xmlnode_nodecontents('volume') volume = volume_node.lower().replace('vol', '').strip() if volume_node else ''
Is it possible for 'author' on line 165 to contain XML? Is that desired or do you want it to be plaintext by that point?
The type variable name on line 100 is the same name as the built-in python function type()
Also, on line 100, do you want to include the <mixed-citation publication-type> tag?
On line 86, will volume always be in group(1)? If it is possible it's in group(2), could produce None Could use : volume = match.group(1) or match.group(2) or '' instead.
The import on line 322 will always run, even when not testing. Can move to inside the if __name__ = '__main__': block.
Line 259 could use super().__init__(...) instead of XMLtoREFs.__init__(...)
Line 69, does the XML always use bookTitle or could it sometimes use book-title?
Regex:

Line 38: Do you want to the regex to match __amp;? or amp;? ?

Line 40: I imagine there should only be one <etal> </etal> tag per reference. If there are more though, the regex is greedy and will capture everything between the first <etal> and the last </etal>

Line 110: do you want to replace all occurrences of amp in a journal name like the (contrived) "Journal of Campsite Astronomy & Astrophysics". (That would be a fun journal!)

Unit tests:

Since the unit test on line 1250 is running the whole process not just __init__ the name could be something like test_process_and_dispatch.
On line 1268 do you want buffer={} or should it match line 1253 where buffer=None is used?

Thomas-S-Allen

Looks good!

OUP parser for references extracted from fulltext

2310e2e

ehenneken requested a review from Thomas-S-Allen March 16, 2026 19:11

Thomas-S-Allen requested changes Mar 18, 2026

View reviewed changes

implementation of PR comments

b99a7bc

Thomas-S-Allen approved these changes Mar 19, 2026

View reviewed changes

ehenneken merged commit dea4fae into adsabs:master Mar 19, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OUP parser for references extracted from fulltext#37

OUP parser for references extracted from fulltext#37
ehenneken merged 2 commits intoadsabs:masterfrom
ehenneken:OUPFTxml_parser

ehenneken commented Mar 16, 2026

Uh oh!

Thomas-S-Allen left a comment

Uh oh!

Thomas-S-Allen left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ehenneken commented Mar 16, 2026

Uh oh!

Thomas-S-Allen left a comment

Choose a reason for hiding this comment

Uh oh!

Thomas-S-Allen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants