Lemmatizer lookups are case-sensitive #9235

Bonnevie · 2021-09-10T12:26:14Z

Bonnevie
Sep 10, 2021

The lookup done by the standard lemmatizers seems to be very sensitive to both natural and unnatural changes in case. This makes the lemmas produced by the pipeline less trustworthy as a preprocessing step and seems like it shouldn't happen.

If the lookups should in general be case-sensitive, it might make sense to have a fallback lookup with lower().

How to reproduce the behaviour

This can be replicated using the standard lemmatizers supplied with the standard models.

import spacy
nlp_en = spacy.load("en_core_web_sm")

assert [w.lemma_ for w in nlp_en("conflating case")] == ['conflate', 'case']
assert [w.lemma_ for w in nlp_en("Conflating case")] == ['conflating', 'case']
assert [w.lemma_ for w in nlp_en("ConflaTing case")] == ['ConflaTing', 'case']

Also observed in the Danish pipeline.

Your Environment

Operating System: Linux 20 LTS, Linux-5.4.0-84-generic-x86_64-with-glibc2.29
Python Version Used: 3.8.10
spaCy Version Used: 3.1.2

Answered by adrianeboyd

Sep 17, 2021

The type of lemmatizer varies across languages, so check nlp.get_pipe("lemmatizer").mode to see for sure for a particular pipeline. Some are rule-based and some are lookup or POS-based lookup lemmatizers, and some languages have their own customizations for what a mode like rule does.

The en_core pipelines include the default English rule-based lemmatizer, and the rule-based lemmatizers depend on token.pos, so typically what's happening in cases like this is that the tagger has made an error between NOUN / PROPN or NOUN / VERB or ADJ / VERB so different rules are applied. Very short phrases like are more likely to be tagged incorrectly than words with more context.

In your example, look a…

View full answer

adrianeboyd · 2021-09-17T08:30:04Z

adrianeboyd
Sep 17, 2021

The type of lemmatizer varies across languages, so check nlp.get_pipe("lemmatizer").mode to see for sure for a particular pipeline. Some are rule-based and some are lookup or POS-based lookup lemmatizers, and some languages have their own customizations for what a mode like rule does.

The en_core pipelines include the default English rule-based lemmatizer, and the rule-based lemmatizers depend on token.pos, so typically what's happening in cases like this is that the tagger has made an error between NOUN / PROPN or NOUN / VERB or ADJ / VERB so different rules are applied. Very short phrases like are more likely to be tagged incorrectly than words with more context.

In your example, look also at w.pos_, which is VERB, NOUN, PROPN for these three cases. PROPN is never lower-cased, but the others are, and -ing isn't removed from NOUN by the rule-based lemmatizer

You can switch to a lookup lemmatizer for English if you'd like more consistent results for tokens with no/little context, or if consistent lemmas are more important than accurate lemmas. It's possible that that would be a better preprocessing step for your task. If you install the package spacy-lookups-data, you can replace the rule-based lemmatizer with a lookup lemmatizer like this:

import spacy
nlp = spacy.load("en_core_web_sm")
nlp.remove_pipe("lemmatizer")
nlp.add_pipe("lemmatizer", config={"mode": "lookup"}).initialize()

If you want different lowercasing behavior than with the current rule-based lemmatizer, then you'd need to create a custom lemmatizer. Look at the examples in spacy/lang/*/lemmatizer.py.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Lemmatizer lookups are case-sensitive #9235

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Lemmatizer lookups are case-sensitive #9235

Uh oh!

Uh oh!

Bonnevie Sep 10, 2021

How to reproduce the behaviour

Your Environment

Replies: 1 comment

Uh oh!

adrianeboyd Sep 17, 2021

Bonnevie
Sep 10, 2021

adrianeboyd
Sep 17, 2021