Different (and incorrect) similarity scores across different spaCy versions #10906
-
How to reproduce the behaviourTry running the following piece of code import spacy
spacy_model = spacy.load("en_core_web_md")
above = spacy_model("above")
front = spacy_model("front")
above.similarity(front) for spaCy 2.2.4 and spaCy 3.3.0. You will get different numbers.
Notice that "above" and "front" are different words, while for 3.3.0, Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
If you use the In the From your results, it looks like In general, there's no guarantee from our side that different versions of an English pipeline will have the exact same vectors (or the exact same results for other components like the tagger), so you if you need the exact results from one particular version, be sure to specify the exact model version in your project requirements along with the exact spacy version. |
Beta Was this translation helpful? Give feedback.
If you use the
en_core_web_lg
vectors I think you should get the same (or extremely similar) results for spacy v2.2-v3.3.In the
en_core_web_md
models, the vectors are pruned so that multiple words get clustered together with the same vector. The pruning step isn't deterministic, so each version ofen_core_web_md
may have slightly different clusters and vectors.From your results, it looks like
above
andfront
ended up in the same cluster in v3.3.0 but not in v2.2.5 (model versions that you can see withpip freeze
orspacy validate
, not the spacy version). There were also some minor changes related to the vector deduplication in v3.3.0 that affected the English vectors in particular (#10551…