-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
Problem
Negation particles and operators affect HPO term retrieval differently than expected across all supported languages (EN, DE, ES, FR, NL):
- German: "Kein Erythem" (no erythema) vs "Erythem" (erythema) produce different retrieval results
- English: "No fever" vs "fever" may have similar issues
- Particles like "no", "kein", "nicht", "sin", "non", "geen", etc. may interfere with semantic matching
- Stop words and other operators might need special handling before embedding
Examples
- German: "Kein Erythem" vs "Erythem"
- English: "No fever" vs "fever"
- Spanish: "Sin fiebre" vs "fiebre"
- French: "Pas de fièvre" vs "fièvre"
- Dutch: "Geen koorts" vs "koorts"
All should retrieve similar HPO terms, with assertion detection marking negated cases appropriately.
Possible Solutions
- Remove negation particles before embedding/retrieval?
- Handle stop words and operators separately in preprocessing?
- Normalize text to strip particles while preserving assertion metadata?
Context
Related to ConText assertion detection system (issue #79) - we detect negations but they may still affect retrieval quality during the vector search phase.
Affected Languages
EN, DE, ES, FR, NL
Labels
enhancement, text-processing, multilingual
Reactions are currently unavailable