LOWER not working for span matcher #11815

iamyihwa · 2022-11-16T13:04:50Z

iamyihwa
Nov 16, 2022

Applying 'LOWER' to span matcher does not seem to work well.

brands = ['nike corp', 'adidas corp']
(1) patterns = [{"label": "BRAND", "pattern": brand} for brand in brands]

(2) patterns = [{"label": "BRAND", "pattern": [{"LOWER": brand }]} for brand in brands]

In scenarios (1) and (2) does not detect reliably the brand names.

How to reproduce the behaviour

brands = ['nike corp', 'adidas corp']
nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("span_ruler")
#patterns = [{"label": "BRAND", "pattern": brand} for brand in brands]
patterns = [{"label": "BRAND", "pattern": [{"LOWER": brand }]}   for brand in brands]
with nlp.select_pipes(enable="tagger"):
    ruler.add_patterns(patterns)
print(patterns)

text = "nike corp is a brand."
doc1 = nlp(text)
print([(span.text, span.label_, span.start) for span in doc1.spans["ruler"]])

text = "NIKE CORP is a brand."
doc1 = nlp(text)
print([(span.text, span.label_, span.start) for span in doc1.spans["ruler"]])

Your Environment

Operating System: Windows
Python Version Used: 3.7.2
spaCy Version Used: 3.4.1
Environment Information:

Answered by rmitsch

Nov 16, 2022

Hi @iamyihwa, the components of the patterns passed on span_ruler have to correspond to spaCy's tokenization. I.e. each part of your pattern has to align with a token as recognized by spaCy.

If you want to want to match for spans consisting of multiple tokens (such as "nike corp" or "adidas corp"), the pattern has to reflect this (see here for another example with a pattern for "san francisco"). So instead of

patterns = [{"label": "BRAND", "pattern": [{"lower": "nike corp"}]}]

you'll want to do:

patterns = [{"label": "BRAND", "pattern": [{"lower": "nike", "lower": "corp"}]}]

In your example you can replace

patterns = [{"label": "BRAND", "pattern": [{"LOWER": brand }]}   for brand in brands]

View full answer

rmitsch · 2022-11-16T14:44:04Z

rmitsch
Nov 16, 2022
Maintainer

Hi @iamyihwa, the components of the patterns passed on span_ruler have to correspond to spaCy's tokenization. I.e. each part of your pattern has to align with a token as recognized by spaCy.

If you want to want to match for spans consisting of multiple tokens (such as "nike corp" or "adidas corp"), the pattern has to reflect this (see here for another example with a pattern for "san francisco"). So instead of

patterns = [{"label": "BRAND", "pattern": [{"lower": "nike corp"}]}]

you'll want to do:

patterns = [{"label": "BRAND", "pattern": [{"lower": "nike", "lower": "corp"}]}]

In your example you can replace

patterns = [{"label": "BRAND", "pattern": [{"LOWER": brand }]}   for brand in brands]

with

patterns = [{"label": "BRAND", "pattern": [{"lower": token} for token in brand.split()]} for brand in brands]

to obtain correct results.

4 replies

iamyihwa Nov 17, 2022
Author

Thanks @rmitsch for your help!

patterns = [{"label": "BRAND", "pattern": [{"lower": token} for token in brand.split()]} for brand in brands]

works perfect!

iamyihwa Nov 17, 2022
Author

Just was wondering how different ways to do Rule Based Matchings are different in their uses.
(e.g. 'EntityRuler', 'SpanMatcher', 'PhraseMatcher', 'TokenMatcher', etc.) are very different.
I intended to do 'EntityRuler' using 'SpanMatcher'.
In fact the solution is by using 'EntityRuler' with 'TokenMatcher'..

rmitsch Nov 17, 2022
Maintainer

An alternative way to achieve the same goal would be to configure your EntityRuler or SpanRuler like so (see the docs):

ruler = nlp.add_pipe("span_ruler", config={"phrase_matcher_attr": "lower"})

Then you can add your patterns without breaking them down into tokens:

patterns = [{"label": "BRAND", "pattern": brand} for brand in brands]

There are some caveats regarding the tokenization though that might or might not be relevant for you if you choose to use this approach.

iamyihwa Nov 17, 2022
Author

Thank you for your explanation @rmitsch !
Sure will use the first approach but will have the second one also in mind!

Thanks for the great work and help!!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

LOWER not working for span matcher #11815

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

LOWER not working for span matcher #11815

Uh oh!

Uh oh!

iamyihwa Nov 16, 2022

How to reproduce the behaviour

Your Environment

Replies: 1 comment · 4 replies

Uh oh!

rmitsch Nov 16, 2022 Maintainer

Uh oh!

iamyihwa Nov 17, 2022 Author

Uh oh!

iamyihwa Nov 17, 2022 Author

Uh oh!

rmitsch Nov 17, 2022 Maintainer

Uh oh!

iamyihwa Nov 17, 2022 Author

iamyihwa
Nov 16, 2022

Replies: 1 comment 4 replies

rmitsch
Nov 16, 2022
Maintainer

iamyihwa Nov 17, 2022
Author

iamyihwa Nov 17, 2022
Author

rmitsch Nov 17, 2022
Maintainer

iamyihwa Nov 17, 2022
Author