EntityRuler: is it possible to ignore spaces, break line and other non visible characters #6508
-
|
Another question about the EntityRuler, what if I have a sentence that contains several break line, e.g: "this is a \n\n sentence" and I have one pattern which is as Here the code example (copied from previous discussions): import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_md')
matcher = Matcher(nlp.vocab)
matcher.add("1", [[{"LOWER":"a"},{"LOWER":"sentence"}]])
for text in ["this is a sentence", "this is a sentence", "this is a\n sentence", "this is a\n\nsentence"]:
for match_id, start, end in matcher(nlp(text)):
print(nlp.vocab.strings[match_id], "matched:", text) |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
|
I'm sure this isn't the answer you're looking for - sorry - but wouldn't it make more sense to try and deal with such newlines, tabs etc in the text in preprocessing, perhaps even before you process the text with spaCy? |
Beta Was this translation helpful? Give feedback.
-
|
Dear @svlandeg, indeed that could be a solution, but if I have entities and/or other structures (e.g. entities), related to the text, from before, I would have to either work with it, or remove the invalid characters and update all the offsets of all the structures. In my first attempt I tried to replace all these invisible characters with spaces, but it did not work, and I was wondering whether there is maybe a (better) way to deal with these characters wihtout removing them. |
Beta Was this translation helpful? Give feedback.
-
|
The only other option is to include optional space tokens at all the possible positions in the pattern with something like |
Beta Was this translation helpful? Give feedback.
The only other option is to include optional space tokens at all the possible positions in the pattern with something like
{"OP": "*", "IS_SPACE": True}.