Regex in Entity Ruler for Japanese #12729

SmallThings-coder · 2023-06-15T09:13:21Z

SmallThings-coder
Jun 15, 2023

I tried to use some regex expressions with Entity ruler to create a new entity. And despite an expression like
f"^(第[{numbers}]{{1,2}})\\s" works fine with just usual (single) space inside Japanese text, the pattern
patterns = [ { "label": "d", "pattern": [ {"TEXT": {"REGEX": f"^第"}}, {"TEXT": {"REGEX": f"[{numbers}]{{1}}"}}, {"TEXT": {"REGEX": f"\\s"}},],}]

would work only if it double space or space entered from Japanese keyboard, which is different bigger English one.

I've tried it with Japanese piplinese ja_core_news_sm and ja ginza electra but results are the same. I understand that Entity Ruler process only one token and time but assumed other then that regex should work the same way. Is there other differences in how spacy process regex? Or am I just missing something?

Answered by kadarakos

Jun 20, 2023

In spaCy the ASCII space at the end of the token is not represented as a separate token, rather it becomes an attribute of the token itself.

For example:

import spacy

nlp = spacy.blank("en")
doc = nlp("Sandra met Jerry on  Monday.")
print([token.text for token in doc])

Here we have 7 tokens:

['Sandra', 'met', 'Jerry', 'on', ' ', 'Monday', '.']

We can check for the trailing half-width whitespace with

[token.whitespace_ for token in doc]

which in this case gives us:

[' ', ' ', ' ', ' ', '', '', '']

The last three tokens are the whitespace " ", which does not have a trailing whitespace and "Monday" is adjacent to ".". Finally, "." is at the end of the string. All other tokens have a trailin…

View full answer

kadarakos · 2023-06-16T13:54:08Z

kadarakos
Jun 16, 2023

Hey TeranotonriOnigasumu,

Would you be able to provide a code example where the regex works as expected but the patterns fail?

0 replies

kadarakos · 2023-06-20T08:45:43Z

kadarakos
Jun 20, 2023

In spaCy the ASCII space at the end of the token is not represented as a separate token, rather it becomes an attribute of the token itself.

For example:

import spacy

nlp = spacy.blank("en")
doc = nlp("Sandra met Jerry on  Monday.")
print([token.text for token in doc])

Here we have 7 tokens:

['Sandra', 'met', 'Jerry', 'on', ' ', 'Monday', '.']

We can check for the trailing half-width whitespace with

[token.whitespace_ for token in doc]

which in this case gives us:

[' ', ' ', ' ', ' ', '', '', '']

The last three tokens are the whitespace " ", which does not have a trailing whitespace and "Monday" is adjacent to ".". Finally, "." is at the end of the string. All other tokens have a trailing whitespace.

In summary, spaCy handles the two spaces differently. If you would like to match tokens that have a trailing space you can use the SPACY attribute in the matcher patterns: https://spacy.io/usage/rule-based-matching#adding-patterns-attributes

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Regex in Entity Ruler for Japanese #12729

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Regex in Entity Ruler for Japanese #12729

Uh oh!

SmallThings-coder Jun 15, 2023

Replies: 2 comments

Uh oh!

kadarakos Jun 16, 2023

Uh oh!

kadarakos Jun 20, 2023

SmallThings-coder
Jun 15, 2023

kadarakos
Jun 16, 2023

kadarakos
Jun 20, 2023