Incorrect lemma casing for English proper adjectives #9056

danmysak · 2021-08-25T12:15:38Z

danmysak
Aug 25, 2021

How to reproduce the behaviour

spacy.load('en_core_web_trf')('He is an American citizen')[3].lemma_

returns 'american' (should have been 'American').

Your Environment

spaCy version: 3.1.1
Platform: macOS-11.5.2-x86_64-i386-64bit
Python version: 3.9.6
Pipelines: en_core_web_trf (3.1.0), en_core_web_sm (3.1.0), en_core_web_lg (3.1.0)

Answered by adrianeboyd

Aug 26, 2021

Hi, the lemmas depend on the POS (token.pos_), so it depends on whether this is tagged as ADJ or PROPN, so I think it's currently the expected results that you'd get American for "He is an American" and american for "He is an American citizen". But also see #3052 and be aware that tagging errors can lead to unexpected lemmas in some cases.

The issue is that the relatively simple ADJ rules in the rule-based lemmatizer treat "happy" and "American" in the same way and lowercase both. If you really need the lemma "American" here, you can add exceptions to the adj table in the lemma exceptions table stored here: nlp.get_pipe("lemmatizer").lookups.get_table("lemma_exc")

View full answer

adrianeboyd · 2021-08-26T06:32:51Z

adrianeboyd
Aug 26, 2021

Hi, the lemmas depend on the POS (token.pos_), so it depends on whether this is tagged as ADJ or PROPN, so I think it's currently the expected results that you'd get American for "He is an American" and american for "He is an American citizen". But also see #3052 and be aware that tagging errors can lead to unexpected lemmas in some cases.

The issue is that the relatively simple ADJ rules in the rule-based lemmatizer treat "happy" and "American" in the same way and lowercase both. If you really need the lemma "American" here, you can add exceptions to the adj table in the lemma exceptions table stored here: nlp.get_pipe("lemmatizer").lookups.get_table("lemma_exc")

3 replies

danmysak Aug 26, 2021
Author

Thank you. I don’t think this is a tagging issue, however. The POS is correct. What is wrong is the lemmatizer’s output—and not just for a single case, but for a large enough generalizable class of cases. I do realize that it might not be trivial to fix, but why this is not considered a bug at all, I don’t quite understand. Or am I misunderstanding what lemma is in spaCy?

adrianeboyd Aug 26, 2021

It's not exactly a bug in the spacy library itself because the lemmatizer code in spacy is doing what it's expected to do with the rules + tables provided to it from spacy-lookups-data.

I totally agree that "american" is not a good lemma, but given the constraints of the tag sets, I don't think there's going to be a good general fix with the rule-based lemmatizer. You can probably get 95% of the way there with lemmatizer exceptions, but alternatively accepting and working with the consistent "american" result might be easier, depending on your task, of course.

The training data we're using for English (OntoNotes) doesn't even include lemmas, so we don't even have a good internal evaluation for the lemmatizer performance. I can find both "american" and "American" as example lemmas in UD corpora (just to pick easily-accessible examples), although "american" looks like it comes from automatic annotation steps, so it's probably just reflecting the same problem we're seeing here.

danmysak Aug 26, 2021
Author

Thanks, Adriane. It’s more clear now.
It’s a pity though that one can’t rely on spaCy’s lemmatization to produce consistently clean results out of the box.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Incorrect lemma casing for English proper adjectives #9056

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Incorrect lemma casing for English proper adjectives #9056

Uh oh!

danmysak Aug 25, 2021

How to reproduce the behaviour

Your Environment

Replies: 1 comment · 3 replies

Uh oh!

adrianeboyd Aug 26, 2021

Uh oh!

danmysak Aug 26, 2021 Author

Uh oh!

adrianeboyd Aug 26, 2021

Uh oh!

danmysak Aug 26, 2021 Author

danmysak
Aug 25, 2021

Replies: 1 comment 3 replies

adrianeboyd
Aug 26, 2021

danmysak Aug 26, 2021
Author

danmysak Aug 26, 2021
Author