Skip to content

OUP parser for references extracted from fulltext#37

Merged
ehenneken merged 2 commits intoadsabs:masterfrom
ehenneken:OUPFTxml_parser
Mar 19, 2026
Merged

OUP parser for references extracted from fulltext#37
ehenneken merged 2 commits intoadsabs:masterfrom
ehenneken:OUPFTxml_parser

Conversation

@ehenneken
Copy link
Copy Markdown
Member

  • OUP parser for references extracted from fulltext
  • unittests for testing of this parser

Copy link
Copy Markdown
Contributor

@Thomas-S-Allen Thomas-S-Allen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary issue below is number 1. I think that one needs to be fixed. The rest depend on the answer to the question or are suggestions.

  1. Lines 121-127 are a repeat/overwrite of lines 108-115 with some missing functionality. I'm guessing the block from 108-115 is preferred.

  2. For the logging on line 313, do you want it to show the length of the outer list references, or for the parsed references from the current bibcode block parsed_references? Currently it is logging the former.

  3. Can journal ever be None, or do the cases in lines 74-92 catch all cases? If it can be None, then line 110 will break.

  4. Similarly, for volume, can xmlnode_nodecontents('volume') be None? If so, line 95 can break. A possible fix:
    if not volume: volume_node = self.xmlnode_nodecontents('volume') volume = volume_node.lower().replace('vol', '').strip() if volume_node else ''

  5. Is it possible for 'author' on line 165 to contain XML? Is that desired or do you want it to be plaintext by that point?

  6. The type variable name on line 100 is the same name as the built-in python function type()

  7. Also, on line 100, do you want to include the <mixed-citation publication-type> tag?

  8. On line 86, will volume always be in group(1)? If it is possible it's in group(2), could produce None Could use : volume = match.group(1) or match.group(2) or '' instead.

  9. The import on line 322 will always run, even when not testing. Can move to inside the if __name__ = '__main__': block.

  10. Line 259 could use super().__init__(...) instead of XMLtoREFs.__init__(...)

  11. Line 69, does the XML always use bookTitle or could it sometimes use book-title?

  12. Regex:

Line 38: Do you want to the regex to match __amp;? or amp;? ?

Line 40: I imagine there should only be one <etal> </etal> tag per reference. If there are more though, the regex is greedy and will capture everything between the first <etal> and the last </etal>

Line 110: do you want to replace all occurrences of amp in a journal name like the (contrived) "Journal of Campsite Astronomy & Astrophysics". (That would be a fun journal!)

Unit tests:

  1. Since the unit test on line 1250 is running the whole process not just __init__ the name could be something like test_process_and_dispatch.

  2. On line 1268 do you want buffer={} or should it match line 1253 where buffer=None is used?

Copy link
Copy Markdown
Contributor

@Thomas-S-Allen Thomas-S-Allen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@ehenneken ehenneken merged commit dea4fae into adsabs:master Mar 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants