Fix #24 and #14: Dropped query detection and inline markup preservation by paulalbert1 · Pull Request #103 · wcmc-its/ReCiter-PubMed-Retrieval-Tool

paulalbert1 · 2026-02-21T01:29:57Z

Summary

Fix Do not use results when PubMed has zero results #24: PubMed silently drops unrecognized name parts (e.g., Charles-rawlins J[au] → J[au]), returning hundreds of irrelevant results. Now checks the eSearch errorlist.phrasenotfound response field — if PubMed dropped phrases and the remaining QueryTranslation is trivially short (just an initial like J), returns 0 results. Applied to both code paths: the count endpoint (controller) and the internal retrieve flow (service).
Fix Account for LaTeX-based symbols #14: The SAX parser was clearing its character buffer on every startElement(), which discarded preceding text whenever inline markup appeared inside <ArticleTitle> or <AbstractText>. This affected MathML (<mml:math>), HTML formatting (, , , ), and any other nested elements. For example, "Role of BRCA1 in cancer" was truncated to "BRCA1 in cancer". Now the buffer is only cleared when NOT inside an active title/abstract accumulation context.

Files changed

File	Issue	Change
`PubMedRetrievalToolController.java`	#24	Added `phrasenotfound` check before existing `isValidAuthorString` validation
`PubMedArticleRetrievalService.java`	#24	Added `isPubMedQueryDropped()` method + check in `getNumberOfPubMedArticles()`
`PubmedEFetchHandler.java`	#14	Conditional `chars.setLength(0)` — skip when inside ArticleTitle or AbstractText

Test plan

Existing parser tests pass (5/5 ✅ verified locally)
Query for a person with hyphenated/unusual surname that PubMed drops (e.g., Charles-rawlins J[au]) → should return 0 results
Query for a normal name (e.g., Kukafka R[au]) → should still return results normally
Fetch an article with MathML in its title (e.g., PMID 31031568) → title should contain the full text including math symbols
Fetch an article with  formatting in its title → full title preserved

🤖 Generated with Claude Code

PubMed "helpfully" drops unrecognized name parts (e.g., "Charles-rawlins J[au]" becomes just "J[au]"), returning hundreds of irrelevant results. Check the eSearch response for errorlist.phrasenotfound entries. If PubMed dropped phrases and the remaining QueryTranslation is trivially short (just an initial), return 0 results instead of importing the noise. Applied to both code paths: the count endpoint (controller) and the internal retrieve flow (service). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The SAX parser was clearing the character buffer on every startElement, which discarded preceding text whenever inline markup appeared inside ArticleTitle or AbstractText. This affected MathML (<mml:math>), HTML formatting (, , , ), and any other nested elements. For example, a title like: "Role of BRCA1 in cancer" would be truncated to just "BRCA1 in cancer". Now the buffer is only cleared when NOT inside an active title/abstract accumulation context, preserving the full text content. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move full XML/JSON response logging from INFO to DEBUG level in PubMedRetrievalToolController. In production, large eSearch responses (up to 15MB) were being logged at INFO on every request, creating large String objects that contribute to heap pressure and OOMKill. Verbose logging can be re-enabled at runtime via env var: LOGGING_LEVEL_RECITER_CONTROLLER=DEBUG Also fixes three SLF4J placeholder bugs where log statements used colon syntax instead of {} placeholders, causing arguments to be silently ignored. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

paulalbert1 · 2026-02-27T19:10:30Z

Added: Reduce memory pressure from verbose PubMed response logging

Context: The reciter-pubmed-prod pods are OOMKilled roughly once per day (16 restarts in 18 days, 5Gi limit). Investigation found that full PubMed eSearch responses (up to 15MB per response) were being logged at INFO level on every request, creating large String objects that contribute to heap pressure.

Changes in commit e153848:

Verbose response logging → DEBUG level (hidden by default):
- PubMedRetrievalToolController.java lines 160-162: full XML response log.info(writer.toString()) → log.debug()
- Line 168: full JSON esearchresult → log.debug()
- Line 188: query translation → log.debug()
Fixed 3 SLF4J placeholder bugs — these log statements used colon syntax ("message:",arg) instead of {} placeholders, so the arguments were silently ignored:
- log.info("PubMed Response Json:",json) → log.debug("PubMed Response Json: {}", json)
- log.info("query translation prior to processing:",queryTranslation) → log.debug(...)
- log.info("eSearchResult count for the firstNameInitial strategy :",eSearchResult.getCount()) → log.info("...: {}", ...)
Added logback-spring.xml with standard console appender configuration.
Added logging.level.reciter.controller=INFO to application.properties as the default. Operators can re-enable verbose logging at runtime via:
```
LOGGING_LEVEL_RECITER_CONTROLLER=DEBUG
```

Note: This code change alone won't fully resolve the OOM issue — the prod EKS deployment also needs JVM tuning (separate from this PR, see below).

paulalbert1 and others added 3 commits February 20, 2026 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #24 and #14: Dropped query detection and inline markup preservation#103

Fix #24 and #14: Dropped query detection and inline markup preservation#103
paulalbert1 wants to merge 3 commits intomasterfrom
fix/esearch-dropped-query-and-mathml

paulalbert1 commented Feb 21, 2026

Uh oh!

paulalbert1 commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paulalbert1 commented Feb 21, 2026

Summary

Files changed

Test plan

Uh oh!

paulalbert1 commented Feb 27, 2026

Added: Reduce memory pressure from verbose PubMed response logging

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant