Normalization of contractions: inconsistency between lemmatizer and norm #8625

phipsgabler · 2021-07-06T09:51:35Z

phipsgabler
Jul 6, 2021

I am a bit concerned about the following appearent inconsistency between lemmata and normal forms:

>>> [(t.lemma_, t.norm_, t.tag_) for t in nlp("he's flying")]
[('he', 'he', 'PRP'), ('be', "'s", 'VBZ'), ('fly', 'flying', 'VBG')]
>>> [(t.lemma_, t.norm_, t.tag_) for t in nlp("I'll be flying")]
[('I', 'i', 'PRP'), ("'ll", 'will', 'MD'), ('be', 'be', 'VB'), ('fly', 'flying', 'VBG')]

Why is 'll lemmatized as 'll but normalized to its full form will, whereas 's behaves the other way round? I'd surely consider will the lemma of 'll. Is this a bug or done intentionally?

What I want to get, eventually, is the sequence of "expanded lemmata":

["he", "be", "fly"]
["I", "will", "be", "fly"]

which I had hoped is possible without manual intervetion by a custom pass or writing exceptions.

spaCy version: 3.0.6
Platform: Linux-5.8.0-59-generic-x86_64-with-glibc2.10
Python version: 3.8.10
Pipelines: en_core_web_sm (3.0.0)

(I have asked this on StackOverflow first, but got not answer.)

Answered by adrianeboyd

Jul 7, 2021

The lemmas and the normalizations come from two separate sources that may or may not be in sync depending on the language defaults and pipeline configuration. There were some regressions in lemmas for contractions in the v3.0.0 pretrained pipelines vs. the v2.3.x pipelines. In the upcoming v3.1.0 models, lemmas for contractions in English will be improved to be more like the v2.3.x models.

If you want to modify the normalizations or lemmas provided by an existing pipeline, there's no good alternative to making manual changes in some form, modifying language defaults, lemmatization tables, attribute ruler rules, or adding a custom component, etc. In this case, my first recommendation would…

View full answer

adrianeboyd · 2021-07-07T07:01:42Z

adrianeboyd
Jul 7, 2021

The lemmas and the normalizations come from two separate sources that may or may not be in sync depending on the language defaults and pipeline configuration. There were some regressions in lemmas for contractions in the v3.0.0 pretrained pipelines vs. the v2.3.x pipelines. In the upcoming v3.1.0 models, lemmas for contractions in English will be improved to be more like the v2.3.x models.

If you want to modify the normalizations or lemmas provided by an existing pipeline, there's no good alternative to making manual changes in some form, modifying language defaults, lemmatization tables, attribute ruler rules, or adding a custom component, etc. In this case, my first recommendation would be to add/edit attribute ruler rules to produce the lemmas that you would prefer for contractions: https://spacy.io/usage/linguistic-features#mappings-exceptions

3 replies

phipsgabler Jul 7, 2021
Author

That answers the practical question, thank you. But don't you consider, on its own, 's being "normalized" to 's a bug?

(I have already seen the issue about the 'll lemmatisation.)

adrianeboyd Jul 7, 2021

No, the NORM needs to be something that's consistent for each ORTH form so it can be used as a model feature without any additional processing. The normalization is intended to group ORTH forms together that should be similar for the model predictions (currency symbols, abbreviations with full forms, American vs. British spellings, etc.) to deal with data sparsity issues, not for disambiguation of ambiguous ORTH forms.

phipsgabler Jul 7, 2021
Author

Ah, I see. Then I have just misunderstood its purpose.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Normalization of contractions: inconsistency between lemmatizer and norm #8625

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Normalization of contractions: inconsistency between lemmatizer and norm #8625

Uh oh!

phipsgabler Jul 6, 2021

Replies: 1 comment · 3 replies

Uh oh!

adrianeboyd Jul 7, 2021

Uh oh!

phipsgabler Jul 7, 2021 Author

Uh oh!

Uh oh!

adrianeboyd Jul 7, 2021

Uh oh!

phipsgabler Jul 7, 2021 Author

phipsgabler
Jul 6, 2021

Replies: 1 comment 3 replies

adrianeboyd
Jul 7, 2021

phipsgabler Jul 7, 2021
Author

phipsgabler Jul 7, 2021
Author