How did English lemmatization change from v2 to v3? #9812

graue70 · 2021-12-06T17:52:22Z

graue70
Dec 6, 2021

If I run the attached code with spaCy 2.3.7, the word "married" is printed twice. If I run the same code with spaCy 3.0.7, it prints marry married. I am trying to find out why the output is different.

How did lemmatization change from v2 to v3 (and especially, how does lemmatization work in v2)? Also, can I reproduce v2 lemmatization results in v3?

import spacy

nlp = spacy.load("en_core_web_lg")

text1 = "married"
text2 = "Who is married to Brad Pitt?"

doc1 = nlp(text1)
doc2 = nlp(text2)
print(doc1[0].lemma_)
print(doc2[2].lemma_)

Answered by adrianeboyd

Dec 8, 2021

The output is probably different because the tagger is producing a different fine-grained tag (token.tag_) for this word. This is mapped to token.pos_ and then the lemma rules are chosen based on POS. The tag->pos mapping and the lemmatizer algorithm are nearly identical in v2.3.x and v3.0.x model versions. The tag->pos mapping was updated for v3.2.x model versions.

Each exact model version (v2.3.0, v2.3.1, v3.0.0) may produce slightly different output for the same example because of model config or training differences, and although the tagger algorithm stayed the same, all the details for the config and training settings changed quite a bit from v2 to v3.

In addition, examples with more…

View full answer

adrianeboyd · 2021-12-08T19:34:21Z

adrianeboyd
Dec 8, 2021

The output is probably different because the tagger is producing a different fine-grained tag (token.tag_) for this word. This is mapped to token.pos_ and then the lemma rules are chosen based on POS. The tag->pos mapping and the lemmatizer algorithm are nearly identical in v2.3.x and v3.0.x model versions. The tag->pos mapping was updated for v3.2.x model versions.

Each exact model version (v2.3.0, v2.3.1, v3.0.0) may produce slightly different output for the same example because of model config or training differences, and although the tagger algorithm stayed the same, all the details for the config and training settings changed quite a bit from v2 to v3.

In addition, examples with more context are more likely to get tagged correctly than individual words, since many words are ambiguous.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How did English lemmatization change from v2 to v3? #9812

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How did English lemmatization change from v2 to v3? #9812

Uh oh!

graue70 Dec 6, 2021

Replies: 1 comment

Uh oh!

adrianeboyd Dec 8, 2021

graue70
Dec 6, 2021

adrianeboyd
Dec 8, 2021