SpaCy matcher not matching case-insensitive words in a document #11404

mahagilo · 2022-08-30T07:10:19Z

mahagilo
Aug 30, 2022

I want SpaCy matcher to match keywords (multi-word entities) in a document irrespective of their case. I can only match "product preferences", not "PRODUCT PREFERENCES," "Product Preferences," or any combination thereof in my document with the code below because token.lemma is case sensitive.

pat_piece = ({"LEMMA": token.lemma_.lower()} if is_final_token(token, tmpdoc)
        else {"LOWER": token.lower_})

I tried forcing it with a {"LEMMA" : { "IN" : [] } } construction, adding .upper() and other cases (idea came from https://stackoverflow.com/questions/64758759/force-spacy-lemmas-to-be-lowercase).

No luck still. Can someone suggest how I can match ALL cases for my keywords (multi-word entities)?

kinghuang · 2022-08-31T01:30:39Z

kinghuang
Aug 31, 2022

Perhaps set a Token extension that returns the lemma in lowercase form, then match on that?

Token.set_extension("lemma_lower", getter=lambda t: t.lemma_.lower())

pat_piece = {"_": {"lemma_lower": token.lemma_.lower()}} if is_final_token(token, tmpdoc) else {"LOWER": token.lower_}

1 reply

mahagilo Aug 31, 2022
Author

Thank you for your insight, kinghuang. How can I save my Spacy matcher patterns to include the token.set_extension?

I use the following code in Spacy matcher (iterating through 20,000 keywords):
token.set_extension("lemma_lower", force=True, getter=lambda t: t.lemma_.lower())
pat_piece = {"": {"lemma_lower": token.lemma.lower()}} if is_final_token(token, tmpdoc) else {"LOWER": token.lower_}
I save nlp and matcher as pickle file:
with open(save_matcher, "wb") as f:
pickle.dump(matcher, f)
with open(save_nlp, "wb") as f:
pickle.dump(nlp, f)
I load nlp and matcher in another script, which gives the error: Can't retrieve unregistered extension attribute 'lemma_lower'. Did you forget to call the set_extension method?
I add the following, which gives the error: token.set_extension("lemma_lower", force=True, getter=lambda t: t.lemma_.lower()) name 'token' is not defined

adrianeboyd · 2022-08-31T07:47:00Z

adrianeboyd
Aug 31, 2022

For context, the related question and my answer from SO: https://stackoverflow.com/questions/73524777/why-is-spacy-matcher-not-matching-case-insensitive-words-in-a-document

With the provided attributes you can only match LOWER or LEMMA, not "lowercase lemma". So if you generate this pattern:
{"LEMMA": "product"}
for a token whose lemma is PRODUCT, it simply won't match.

If you want to match lowercase lemmas, some options:

postprocess the docs to lowercase lemmas before running the matcher (either separately in your script or with a custom pipeline component)

use a custom lemmatizer that produces lowercase lemmas

use a custom extension with a getter to return the lowercase form of the lemma for use with a "_" matcher pattern (a "property extension" as described here: https://spacy.io/usage/processing-pipelines#description)

If your only concern is matching lowercase lemmas, I'd suggest the first option as the easiest to implement and fastest to run in the matcher.

At this point it's not clear exactly what you're trying to match and what's matching or not matching. Please include a minimal example that shows a full doc and the full matcher patterns that you're testing, and explain what the intended matches are.

3 replies

mahagilo Aug 31, 2022
Author

Okay, let me clarify: Building a matcher pattern for 20,000 keywords takes a while. {Note: I save two modules (spacy matcher and nlp) in 2 pickle files and then load them in a keyword extraction module.}
After I changed the Spacy matcher to include your code token.set_extension("lemma_lower", force=True, getter=lambda t: t.lemma_.lower()), finding keywords with capitalization works fine. But when I load the pickle matcher and nlp in the keyword extractor module, it gives the error: “Can't retrieve unregistered extension attribute 'lemma_lower.' Did you forget to call the set_extension method?”
Then, when adding the line token.set_extension("lemma_lower", force=True, getter=lambda t: t.lemma_.lower()) anywhere in the extractor module, this error comes up: "token undefined". Could you help me solve this error, either by saving differently or adding the token.set_extension somewhere? Thank you.

kinghuang Sep 1, 2022

Then, when adding the line token.set_extension("lemma_lower", force=True, getter=lambda t: t.lemma_.lower()) anywhere in the extractor module, this error comes up: "token undefined".

The extension should be set on the Token class.

from spacy.tokens import Token

Token.set_extension("lemma_lower", getter=lambda t: t.lemma_.lower(), force=True)

kinghuang Sep 1, 2022

After I changed the Spacy matcher to include your code token.set_extension("lemma_lower", force=True, getter=lambda t: t.lemma_.lower()), finding keywords with capitalization works fine. But when I load the pickle matcher and nlp in the keyword extractor module, it gives the error: “Can't retrieve unregistered extension attribute 'lemma_lower.' Did you forget to call the set_extension method?”

You should register the extension somewhere in your code before you try to use it. For example, if you're assembling a custom spaCy language model, you can write a callback function to include with the model and set it for before_creation so that it's automatically called when the language object is created.

from spacy.tokens import Token
from spacy.util import registry

@registry.callbacks("register_extensions")
def register_extensions():
    Token.set_extension("lemma_lower", getter=lambda t: t.lemma_.lower(), force=True)

[nlp]
before_creation = {"@callbacks":"register_extensions"}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

SpaCy matcher not matching case-insensitive words in a document #11404

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

SpaCy matcher not matching case-insensitive words in a document #11404

Uh oh!

mahagilo Aug 30, 2022

Replies: 2 comments · 4 replies

Uh oh!

Uh oh!

kinghuang Aug 31, 2022

Uh oh!

mahagilo Aug 31, 2022 Author

Uh oh!

adrianeboyd Aug 31, 2022

Uh oh!

mahagilo Aug 31, 2022 Author

Uh oh!

kinghuang Sep 1, 2022

Uh oh!

Uh oh!

kinghuang Sep 1, 2022

mahagilo
Aug 30, 2022

Replies: 2 comments 4 replies

kinghuang
Aug 31, 2022

mahagilo Aug 31, 2022
Author

adrianeboyd
Aug 31, 2022

mahagilo Aug 31, 2022
Author