Regex in Entity Ruler for Japanese #12729
-
I tried to use some regex expressions with Entity ruler to create a new entity. And despite an expression like would work only if it double space or space entered from Japanese keyboard, which is different bigger English one. I've tried it with Japanese piplinese ja_core_news_sm and ja ginza electra but results are the same. I understand that Entity Ruler process only one token and time but assumed other then that regex should work the same way. Is there other differences in how spacy process regex? Or am I just missing something? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hey TeranotonriOnigasumu, Would you be able to provide a code example where the regex works as expected but the patterns fail? |
Beta Was this translation helpful? Give feedback.
-
In spaCy the ASCII space at the end of the token is not represented as a separate token, rather it becomes an attribute of the token itself. For example: import spacy
nlp = spacy.blank("en")
doc = nlp("Sandra met Jerry on Monday.")
print([token.text for token in doc]) Here we have 7 tokens: ['Sandra', 'met', 'Jerry', 'on', ' ', 'Monday', '.'] We can check for the trailing half-width whitespace with [token.whitespace_ for token in doc] which in this case gives us: [' ', ' ', ' ', ' ', '', '', ''] The last three tokens are the whitespace In summary, spaCy handles the two spaces differently. If you would like to match tokens that have a trailing space you can use the |
Beta Was this translation helpful? Give feedback.
In spaCy the ASCII space at the end of the token is not represented as a separate token, rather it becomes an attribute of the token itself.
For example:
Here we have 7 tokens:
We can check for the trailing half-width whitespace with
which in this case gives us:
The last three tokens are the whitespace
" "
, which does not have a trailing whitespace and"Monday"
is adjacent to"."
. Finally,"."
is at the end of the string. All other tokens have a trailin…