Skip to content

Negation particles and operators affect retrieval accuracy across languages #126

@janpower

Description

@janpower

Problem

Negation particles and operators affect HPO term retrieval differently than expected across all supported languages (EN, DE, ES, FR, NL):

  • German: "Kein Erythem" (no erythema) vs "Erythem" (erythema) produce different retrieval results
  • English: "No fever" vs "fever" may have similar issues
  • Particles like "no", "kein", "nicht", "sin", "non", "geen", etc. may interfere with semantic matching
  • Stop words and other operators might need special handling before embedding

Examples

  • German: "Kein Erythem" vs "Erythem"
  • English: "No fever" vs "fever"
  • Spanish: "Sin fiebre" vs "fiebre"
  • French: "Pas de fièvre" vs "fièvre"
  • Dutch: "Geen koorts" vs "koorts"

All should retrieve similar HPO terms, with assertion detection marking negated cases appropriately.

Possible Solutions

  1. Remove negation particles before embedding/retrieval?
  2. Handle stop words and operators separately in preprocessing?
  3. Normalize text to strip particles while preserving assertion metadata?

Context

Related to ConText assertion detection system (issue #79) - we detect negations but they may still affect retrieval quality during the vector search phase.

Affected Languages

EN, DE, ES, FR, NL

Labels

enhancement, text-processing, multilingual

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions