Skip to content

Fix #24 and #14: Dropped query detection and inline markup preservation#103

Open
paulalbert1 wants to merge 3 commits intomasterfrom
fix/esearch-dropped-query-and-mathml
Open

Fix #24 and #14: Dropped query detection and inline markup preservation#103
paulalbert1 wants to merge 3 commits intomasterfrom
fix/esearch-dropped-query-and-mathml

Conversation

@paulalbert1
Copy link
Copy Markdown
Contributor

Summary

  • Fix Do not use results when PubMed has zero results #24: PubMed silently drops unrecognized name parts (e.g., Charles-rawlins J[au]J[au]), returning hundreds of irrelevant results. Now checks the eSearch errorlist.phrasenotfound response field — if PubMed dropped phrases and the remaining QueryTranslation is trivially short (just an initial like J), returns 0 results. Applied to both code paths: the count endpoint (controller) and the internal retrieve flow (service).

  • Fix Account for LaTeX-based symbols #14: The SAX parser was clearing its character buffer on every startElement(), which discarded preceding text whenever inline markup appeared inside <ArticleTitle> or <AbstractText>. This affected MathML (<mml:math>), HTML formatting (<i>, <b>, <sup>, <sub>), and any other nested elements. For example, "Role of <i>BRCA1</i> in cancer" was truncated to "BRCA1 in cancer". Now the buffer is only cleared when NOT inside an active title/abstract accumulation context.

Files changed

File Issue Change
PubMedRetrievalToolController.java #24 Added phrasenotfound check before existing isValidAuthorString validation
PubMedArticleRetrievalService.java #24 Added isPubMedQueryDropped() method + check in getNumberOfPubMedArticles()
PubmedEFetchHandler.java #14 Conditional chars.setLength(0) — skip when inside ArticleTitle or AbstractText

Test plan

  • Existing parser tests pass (5/5 ✅ verified locally)
  • Query for a person with hyphenated/unusual surname that PubMed drops (e.g., Charles-rawlins J[au]) → should return 0 results
  • Query for a normal name (e.g., Kukafka R[au]) → should still return results normally
  • Fetch an article with MathML in its title (e.g., PMID 31031568) → title should contain the full text including math symbols
  • Fetch an article with <i> formatting in its title → full title preserved

🤖 Generated with Claude Code

paulalbert1 and others added 3 commits February 20, 2026 20:28
PubMed "helpfully" drops unrecognized name parts (e.g., "Charles-rawlins
J[au]" becomes just "J[au]"), returning hundreds of irrelevant results.

Check the eSearch response for errorlist.phrasenotfound entries. If PubMed
dropped phrases and the remaining QueryTranslation is trivially short
(just an initial), return 0 results instead of importing the noise.

Applied to both code paths: the count endpoint (controller) and the
internal retrieve flow (service).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The SAX parser was clearing the character buffer on every startElement,
which discarded preceding text whenever inline markup appeared inside
ArticleTitle or AbstractText. This affected MathML (<mml:math>),
HTML formatting (<i>, <b>, <sup>, <sub>), and any other nested elements.

For example, a title like:
  "Role of <i>BRCA1</i> in cancer"
would be truncated to just "BRCA1 in cancer".

Now the buffer is only cleared when NOT inside an active title/abstract
accumulation context, preserving the full text content.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move full XML/JSON response logging from INFO to DEBUG level in
PubMedRetrievalToolController. In production, large eSearch responses
(up to 15MB) were being logged at INFO on every request, creating
large String objects that contribute to heap pressure and OOMKill.

Verbose logging can be re-enabled at runtime via env var:
  LOGGING_LEVEL_RECITER_CONTROLLER=DEBUG

Also fixes three SLF4J placeholder bugs where log statements used
colon syntax instead of {} placeholders, causing arguments to be
silently ignored.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@paulalbert1
Copy link
Copy Markdown
Contributor Author

Added: Reduce memory pressure from verbose PubMed response logging

Context: The reciter-pubmed-prod pods are OOMKilled roughly once per day (16 restarts in 18 days, 5Gi limit). Investigation found that full PubMed eSearch responses (up to 15MB per response) were being logged at INFO level on every request, creating large String objects that contribute to heap pressure.

Changes in commit e153848:

  1. Verbose response logging → DEBUG level (hidden by default):

    • PubMedRetrievalToolController.java lines 160-162: full XML response log.info(writer.toString())log.debug()
    • Line 168: full JSON esearchresultlog.debug()
    • Line 188: query translation → log.debug()
  2. Fixed 3 SLF4J placeholder bugs — these log statements used colon syntax ("message:",arg) instead of {} placeholders, so the arguments were silently ignored:

    • log.info("PubMed Response Json:",json)log.debug("PubMed Response Json: {}", json)
    • log.info("query translation prior to processing:",queryTranslation)log.debug(...)
    • log.info("eSearchResult count for the firstNameInitial strategy :",eSearchResult.getCount())log.info("...: {}", ...)
  3. Added logback-spring.xml with standard console appender configuration.

  4. Added logging.level.reciter.controller=INFO to application.properties as the default. Operators can re-enable verbose logging at runtime via:

    LOGGING_LEVEL_RECITER_CONTROLLER=DEBUG
    

Note: This code change alone won't fully resolve the OOM issue — the prod EKS deployment also needs JVM tuning (separate from this PR, see below).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Do not use results when PubMed has zero results Account for LaTeX-based symbols

1 participant