Normalization of contractions: inconsistency between lemmatizer and norm #8625
-
I am a bit concerned about the following appearent inconsistency between lemmata and normal forms:
Why is What I want to get, eventually, is the sequence of "expanded lemmata":
which I had hoped is possible without manual intervetion by a custom pass or writing exceptions.
(I have asked this on StackOverflow first, but got not answer.) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
The lemmas and the normalizations come from two separate sources that may or may not be in sync depending on the language defaults and pipeline configuration. There were some regressions in lemmas for contractions in the v3.0.0 pretrained pipelines vs. the v2.3.x pipelines. In the upcoming v3.1.0 models, lemmas for contractions in English will be improved to be more like the v2.3.x models. If you want to modify the normalizations or lemmas provided by an existing pipeline, there's no good alternative to making manual changes in some form, modifying language defaults, lemmatization tables, attribute ruler rules, or adding a custom component, etc. In this case, my first recommendation would be to add/edit attribute ruler rules to produce the lemmas that you would prefer for contractions: https://spacy.io/usage/linguistic-features#mappings-exceptions |
Beta Was this translation helpful? Give feedback.
The lemmas and the normalizations come from two separate sources that may or may not be in sync depending on the language defaults and pipeline configuration. There were some regressions in lemmas for contractions in the v3.0.0 pretrained pipelines vs. the v2.3.x pipelines. In the upcoming v3.1.0 models, lemmas for contractions in English will be improved to be more like the v2.3.x models.
If you want to modify the normalizations or lemmas provided by an existing pipeline, there's no good alternative to making manual changes in some form, modifying language defaults, lemmatization tables, attribute ruler rules, or adding a custom component, etc. In this case, my first recommendation would…