EntityRuler: is it possible to ignore spaces, break line and other non visible characters #6508

lfoppiano · 2020-12-05T06:54:54Z

lfoppiano
Dec 5, 2020

Another question about the EntityRuler, what if I have a sentence that contains several break line, e.g:

"this is a \n\n sentence"

and I have one pattern which is as {"LOWER":"a"},{"LOWER":"sentence"} it won't match.
I wonder if it would be possible to tell the EntityRuler to just ignore the characters as double spaces, tabs, breaklines... etc..
Or is there a way to overcome the possible garbage in the text?

Here the code example (copied from previous discussions):

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_md')

matcher = Matcher(nlp.vocab)
matcher.add("1", [[{"LOWER":"a"},{"LOWER":"sentence"}]])


for text in ["this is a sentence", "this is a  sentence", "this is a\n sentence", "this is a\n\nsentence"]:
    for match_id, start, end in matcher(nlp(text)):
        print(nlp.vocab.strings[match_id], "matched:", text)

Answered by adrianeboyd

Dec 7, 2020

The only other option is to include optional space tokens at all the possible positions in the pattern with something like {"OP": "*", "IS_SPACE": True}.

View full answer

svlandeg · 2020-12-05T16:47:30Z

svlandeg
Dec 5, 2020

I'm sure this isn't the answer you're looking for - sorry - but wouldn't it make more sense to try and deal with such newlines, tabs etc in the text in preprocessing, perhaps even before you process the text with spaCy?

0 replies

lfoppiano · 2020-12-06T11:36:03Z

lfoppiano
Dec 6, 2020
Author

Dear @svlandeg, indeed that could be a solution, but if I have entities and/or other structures (e.g. entities), related to the text, from before, I would have to either work with it, or remove the invalid characters and update all the offsets of all the structures.

In my first attempt I tried to replace all these invisible characters with spaces, but it did not work, and I was wondering whether there is maybe a (better) way to deal with these characters wihtout removing them.

0 replies

adrianeboyd · 2020-12-07T07:08:12Z

adrianeboyd
Dec 7, 2020

The only other option is to include optional space tokens at all the possible positions in the pattern with something like {"OP": "*", "IS_SPACE": True}.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EntityRuler: is it possible to ignore spaces, break line and other non visible characters #6508

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

EntityRuler: is it possible to ignore spaces, break line and other non visible characters #6508

Uh oh!

lfoppiano Dec 5, 2020

Replies: 3 comments

Uh oh!

svlandeg Dec 5, 2020

Uh oh!

lfoppiano Dec 6, 2020 Author

Uh oh!

adrianeboyd Dec 7, 2020

lfoppiano
Dec 5, 2020

svlandeg
Dec 5, 2020

lfoppiano
Dec 6, 2020
Author

adrianeboyd
Dec 7, 2020