Issue with uppercased words, when using Matcher LEMMA attribute #11051
-
I'm currently trying to implement spaCy's Matcher for a project and I am having issues with the LEMMA attribute. Example:
As you can see the matcher don't find the three 'cats' but only the first one. To find the problem, I printed the lemmas of the doc, and found this:
The problem here is that the lemmatizer don't work because of the uppercases in the word, even if it lowercases the last 'CAts' you can see that the lemma is 'cats' and not 'cat' as intended. I really need to use the LEMMA attribute of the Matcher and, as I lose information using the LOWER attribute. My last option is to lowercase my texts, which isn't the best solution in my opinion. I understand that this issue is linked to the lemmatizer but I wondered if there was a solution to this issue? Info about spaCy
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
The principled way to do this, that will cause the least overall weirdness, is to train a truecasing model (a model that can tell you what the case of words should be) and use that to process text before passing it to spaCy. I think there should be a way to do this in a less principled way, by changing the lemmatizer to treat all proper nouns as normal nouns and lowercasing them before lookup, but it would require a bit of work with the Lemmatizer implementation. Maybe look at It's unfortunate that's kind of involved, it's a process we could document better. |
Beta Was this translation helpful? Give feedback.
The principled way to do this, that will cause the least overall weirdness, is to train a truecasing model (a model that can tell you what the case of words should be) and use that to process text before passing it to spaCy.
I think there should be a way to do this in a less principled way, by changing the lemmatizer to treat all proper nouns as normal nouns and lowercasing them before lookup, but it would require a bit of work with the Lemmatizer implementation. Maybe look at
rule_lemmatize
and implement something similar in a subclass, sayspecial_lemmatize
. Then you can use your own class and passmode = "special"
via the config to use it.It's unfortunate that's kind of involved, it's…