Skip to content

Fix SAX parser state leakage for PubmedBookArticle#106

Open
paulalbert1 wants to merge 2 commits intomasterfrom
dev
Open

Fix SAX parser state leakage for PubmedBookArticle#106
paulalbert1 wants to merge 2 commits intomasterfrom
dev

Conversation

@paulalbert1
Copy link
Copy Markdown
Contributor

Summary

Fixes a bug where PubmedBookArticle elements in PubMed eFetch responses corrupt the previously-parsed PubmedArticle record's data (author list, abstract, etc.) via SAX parser state leakage.

Changes

  • Null out pubmedArticle when <PubmedBookArticle> is encountered — prevents book article fields from overwriting the previous article's data
  • Removed duplicate AuthorList handler block — identical code appeared twice in startElement

Root Cause

PubmedEFetchHandler.java uses a single pubmedArticle instance variable. When a <PubmedBookArticle> follows a <PubmedArticle> in the same batch, no new object is created — so the book article's AuthorList, Abstract, etc. overwrite the previous article via the still-live reference.

Impact

  • Example: PMID 38461731 (8-author MS study) had its author list silently replaced by 79 authors from PMID 41525446 (a PCORI book report)
  • Affects any PubmedArticle that precedes a PubmedBookArticle in an eFetch batch
  • Corrupted data persists in DynamoDB until articles are re-retrieved

Verified on dev

Merged to dev via PR #105 and deployed.

After merge to master

Re-retrieve articles for affected users with refreshFlag=ALL_PUBLICATIONS to replace corrupted DynamoDB records.

Long-term

Proper PubmedBookArticle parsing is planned as a separate effort (see MULTI_SOURCE_ARCHITECTURE.md in scoring research repo).

paulalbert1 and others added 2 commits March 9, 2026 13:56
…icle

The SAX parser uses a single instance-level `pubmedArticle` variable.
When PubMed returns a batch containing both <PubmedArticle> and
<PubmedBookArticle> elements, there was no handler for
<PubmedBookArticle>. When a book article followed a regular article,
`pubmedArticle` still referenced the previous article, causing the
book article's AuthorList, Abstract, and other fields to overwrite
the previous article's data. This corrupted data would then be saved
to DynamoDB.

For example, PMID 38461731 had its 8-author list replaced by 79
authors from book article PMID 41525446.

Fix: null out `pubmedArticle` when <PubmedBookArticle> is encountered,
so all subsequent element processing is skipped by the existing
`if (pubmedArticle != null)` guard.

Also remove a duplicate AuthorList handler block in startElement that
re-initialized the author list unnecessarily.

This is a short-term fix; long-term plan is to properly parse and
support PubmedBookArticle types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix SAX parser state leakage when PubmedBookArticle follows PubmedArticle
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant