Issue with uppercased words, when using Matcher LEMMA attribute #11051

jademlc · 2022-06-29T07:58:18Z

jademlc
Jun 29, 2022

I'm currently trying to implement spaCy's Matcher for a project and I am having issues with the LEMMA attribute.
For my project I am working with noisy data and I often encounter words with uppercases in it. This causes trouble with the lemmatizer and the matcher don't match the things it should.

Example:
The following Matcher should matcher every work that has 'cat' as a lemma:

to_match = [{'LEMMA': 'cat'}]

doc = nlp('I love my cats, I love my CATS, I love my CAts')

cplx_matcher = Matcher(nlp.vocab)
cplx_matcher.add("patterns", [to_match], greedy='LONGEST')

matches = cplx_matcher(doc, as_spans=True)
print(matches)
# [cats]

As you can see the matcher don't find the three 'cats' but only the first one. To find the problem, I printed the lemmas of the doc, and found this:

I love my cat , I love my CATS , I love my cats

The problem here is that the lemmatizer don't work because of the uppercases in the word, even if it lowercases the last 'CAts' you can see that the lemma is 'cats' and not 'cat' as intended.

I really need to use the LEMMA attribute of the Matcher and, as I lose information using the LOWER attribute. My last option is to lowercase my texts, which isn't the best solution in my opinion.

I understand that this issue is linked to the lemmatizer but I wondered if there was a solution to this issue?

Info about spaCy

spaCy version: 3.3.0
Platform: Linux-5.13.0-51-generic-x86_64-with-glibc2.29
Python version: 3.8.5

Answered by polm

Jul 1, 2022

The principled way to do this, that will cause the least overall weirdness, is to train a truecasing model (a model that can tell you what the case of words should be) and use that to process text before passing it to spaCy.

I think there should be a way to do this in a less principled way, by changing the lemmatizer to treat all proper nouns as normal nouns and lowercasing them before lookup, but it would require a bit of work with the Lemmatizer implementation. Maybe look at rule_lemmatize and implement something similar in a subclass, say special_lemmatize. Then you can use your own class and pass mode = "special" via the config to use it.

It's unfortunate that's kind of involved, it's…

View full answer

polm · 2022-07-01T05:01:33Z

polm
Jul 1, 2022

The principled way to do this, that will cause the least overall weirdness, is to train a truecasing model (a model that can tell you what the case of words should be) and use that to process text before passing it to spaCy.

I think there should be a way to do this in a less principled way, by changing the lemmatizer to treat all proper nouns as normal nouns and lowercasing them before lookup, but it would require a bit of work with the Lemmatizer implementation. Maybe look at rule_lemmatize and implement something similar in a subclass, say special_lemmatize. Then you can use your own class and pass mode = "special" via the config to use it.

It's unfortunate that's kind of involved, it's a process we could document better.

1 reply

jademlc Jul 13, 2022
Author

This worked perfectly for me, thank you for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Issue with uppercased words, when using Matcher LEMMA attribute #11051

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Issue with uppercased words, when using Matcher LEMMA attribute #11051

Uh oh!

jademlc Jun 29, 2022

Info about spaCy

Replies: 1 comment · 1 reply

Uh oh!

polm Jul 1, 2022

Uh oh!

jademlc Jul 13, 2022 Author

jademlc
Jun 29, 2022

Replies: 1 comment 1 reply

polm
Jul 1, 2022

jademlc Jul 13, 2022
Author