How did English lemmatization change from v2 to v3? #9812
-
If I run the attached code with spaCy 2.3.7, the word "married" is printed twice. If I run the same code with spaCy 3.0.7, it prints How did lemmatization change from v2 to v3 (and especially, how does lemmatization work in v2)? Also, can I reproduce v2 lemmatization results in v3? import spacy
nlp = spacy.load("en_core_web_lg")
text1 = "married"
text2 = "Who is married to Brad Pitt?"
doc1 = nlp(text1)
doc2 = nlp(text2)
print(doc1[0].lemma_)
print(doc2[2].lemma_) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
The output is probably different because the tagger is producing a different fine-grained tag ( Each exact model version (v2.3.0, v2.3.1, v3.0.0) may produce slightly different output for the same example because of model config or training differences, and although the tagger algorithm stayed the same, all the details for the config and training settings changed quite a bit from v2 to v3. In addition, examples with more context are more likely to get tagged correctly than individual words, since many words are ambiguous. |
Beta Was this translation helpful? Give feedback.
The output is probably different because the tagger is producing a different fine-grained tag (
token.tag_
) for this word. This is mapped totoken.pos_
and then the lemma rules are chosen based on POS. The tag->pos mapping and the lemmatizer algorithm are nearly identical in v2.3.x and v3.0.x model versions. The tag->pos mapping was updated for v3.2.x model versions.Each exact model version (v2.3.0, v2.3.1, v3.0.0) may produce slightly different output for the same example because of model config or training differences, and although the tagger algorithm stayed the same, all the details for the config and training settings changed quite a bit from v2 to v3.
In addition, examples with more…