Related but unique words have the same embedding vector representation #10986
-
How to reproduce the behaviour
These related words have the exact same vector representation, which they should not. I assume this is a bug.
Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
That's a feature, not a bug 😉. spaCy allows you to prune the vectors in a vocabulary to help keep models lightweight. You can also confirm from inspecting the model card that the medium English model comes with 685K keys in the vocabulary but only 20K unique vectors. If you were to check the |
Beta Was this translation helpful? Give feedback.
-
oh, I was running tests on the smaller model locally! Thanks for the correction @koaning! |
Beta Was this translation helpful? Give feedback.
That's a feature, not a bug 😉.
spaCy allows you to prune the vectors in a vocabulary to help keep models lightweight. You can also confirm from inspecting the model card that the medium English model comes with 685K keys in the vocabulary but only 20K unique vectors.
If you were to check the
en_core_web_lg
model then these vectors should be different.