Fix #24 and #14: Dropped query detection and inline markup preservation#103
Fix #24 and #14: Dropped query detection and inline markup preservation#103paulalbert1 wants to merge 3 commits intomasterfrom
Conversation
PubMed "helpfully" drops unrecognized name parts (e.g., "Charles-rawlins J[au]" becomes just "J[au]"), returning hundreds of irrelevant results. Check the eSearch response for errorlist.phrasenotfound entries. If PubMed dropped phrases and the remaining QueryTranslation is trivially short (just an initial), return 0 results instead of importing the noise. Applied to both code paths: the count endpoint (controller) and the internal retrieve flow (service). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The SAX parser was clearing the character buffer on every startElement, which discarded preceding text whenever inline markup appeared inside ArticleTitle or AbstractText. This affected MathML (<mml:math>), HTML formatting (<i>, <b>, <sup>, <sub>), and any other nested elements. For example, a title like: "Role of <i>BRCA1</i> in cancer" would be truncated to just "BRCA1 in cancer". Now the buffer is only cleared when NOT inside an active title/abstract accumulation context, preserving the full text content. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move full XML/JSON response logging from INFO to DEBUG level in
PubMedRetrievalToolController. In production, large eSearch responses
(up to 15MB) were being logged at INFO on every request, creating
large String objects that contribute to heap pressure and OOMKill.
Verbose logging can be re-enabled at runtime via env var:
LOGGING_LEVEL_RECITER_CONTROLLER=DEBUG
Also fixes three SLF4J placeholder bugs where log statements used
colon syntax instead of {} placeholders, causing arguments to be
silently ignored.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Added: Reduce memory pressure from verbose PubMed response loggingContext: The Changes in commit e153848:
Note: This code change alone won't fully resolve the OOM issue — the prod EKS deployment also needs JVM tuning (separate from this PR, see below). |
Summary
Fix Do not use results when PubMed has zero results #24: PubMed silently drops unrecognized name parts (e.g.,
Charles-rawlins J[au]→J[au]), returning hundreds of irrelevant results. Now checks the eSearcherrorlist.phrasenotfoundresponse field — if PubMed dropped phrases and the remainingQueryTranslationis trivially short (just an initial likeJ), returns 0 results. Applied to both code paths: the count endpoint (controller) and the internal retrieve flow (service).Fix Account for LaTeX-based symbols #14: The SAX parser was clearing its character buffer on every
startElement(), which discarded preceding text whenever inline markup appeared inside<ArticleTitle>or<AbstractText>. This affected MathML (<mml:math>), HTML formatting (<i>,<b>,<sup>,<sub>), and any other nested elements. For example,"Role of <i>BRCA1</i> in cancer"was truncated to"BRCA1 in cancer". Now the buffer is only cleared when NOT inside an active title/abstract accumulation context.Files changed
PubMedRetrievalToolController.javaphrasenotfoundcheck before existingisValidAuthorStringvalidationPubMedArticleRetrievalService.javaisPubMedQueryDropped()method + check ingetNumberOfPubMedArticles()PubmedEFetchHandler.javachars.setLength(0)— skip when inside ArticleTitle or AbstractTextTest plan
Charles-rawlins J[au]) → should return 0 resultsKukafka R[au]) → should still return results normally<i>formatting in its title → full title preserved🤖 Generated with Claude Code