Lemmatizer lookups are case-sensitive #9235
-
The lookup done by the standard lemmatizers seems to be very sensitive to both natural and unnatural changes in case. This makes the lemmas produced by the pipeline less trustworthy as a preprocessing step and seems like it shouldn't happen. If the lookups should in general be case-sensitive, it might make sense to have a fallback lookup with How to reproduce the behaviourThis can be replicated using the standard lemmatizers supplied with the standard models. import spacy
nlp_en = spacy.load("en_core_web_sm")
assert [w.lemma_ for w in nlp_en("conflating case")] == ['conflate', 'case']
assert [w.lemma_ for w in nlp_en("Conflating case")] == ['conflating', 'case']
assert [w.lemma_ for w in nlp_en("ConflaTing case")] == ['ConflaTing', 'case'] Also observed in the Danish pipeline. Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
The type of lemmatizer varies across languages, so check The In your example, look also at You can switch to a lookup lemmatizer for English if you'd like more consistent results for tokens with no/little context, or if consistent lemmas are more important than accurate lemmas. It's possible that that would be a better preprocessing step for your task. If you install the package import spacy
nlp = spacy.load("en_core_web_sm")
nlp.remove_pipe("lemmatizer")
nlp.add_pipe("lemmatizer", config={"mode": "lookup"}).initialize() If you want different lowercasing behavior than with the current rule-based lemmatizer, then you'd need to create a custom lemmatizer. Look at the examples in |
Beta Was this translation helpful? Give feedback.
The type of lemmatizer varies across languages, so check
nlp.get_pipe("lemmatizer").mode
to see for sure for a particular pipeline. Some are rule-based and some are lookup or POS-based lookup lemmatizers, and some languages have their own customizations for what a mode likerule
does.The
en_core
pipelines include the default English rule-based lemmatizer, and the rule-based lemmatizers depend ontoken.pos
, so typically what's happening in cases like this is that the tagger has made an error betweenNOUN
/PROPN
orNOUN
/VERB
orADJ
/VERB
so different rules are applied. Very short phrases like are more likely to be tagged incorrectly than words with more context.In your example, look a…