Description
When the endline marker \n is located just behind a word without a space, the \n and the following space ("\n ")are detected as a unique token, tagged as SPACE and are skipped because ignore_space_tokens=True in some pipelines. Thus, after normalization the word before and the word after \n are concatenated and pipelines can no longer detect the word after. In the code, the pipeline eds.diabetes() don't detect the word "diabète" and the following code using get_text() explains why.
How to reproduce the bug
import edsnlp, edsnlp.pipes as eds
txt="problématique\n Diabète de type 1 depuis 5 ans chez une enfant de 7 ans"
nlp=edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
nlp.add_pipe(eds.normalizer())
nlp.add_pipe(eds.diabetes())
doc=nlp(txt)
for ent in doc.ents:
print(ent.text, ent.label_)
--> Nothing in the terminal
from edsnlp.utils.doc_to_text import get_text
get_text(doc, attr="NORM", ignore_excluded=True, ignore_space_tokens=True)
--> 'problematiquediabete de type 1 depuis 5 ans chez une enfant de 7 ans'
## Your Environment
- Operating System: Windows11
- Python Version Used: 3.10.16
- EDS-NLP Version Used: 0.17.1