How is the tok2vec component dealing with OOV words #7729
-
I'm really struggling to understand how spaCy is dealing with out of vocabulary words, in the small model it is able to generate both vectors and tensors for words that would be considered out of vocabulary and in the medium and Large models they can produce tensors even though the word is not in the vector vocabulary. I'll provide a few examples For the small model:
This produces
For the medium model:
This produces:
The typical way to produce embedding is to have a vocabulary as a dictionary whose keys are tokens and values are IDs and then the ID is usually used to index into a matrix that contains the embedding for the token, in the case of a token not being in the vocabulary there is usually an ID set aside that all such words will be mapped to, whose embedding could be all zeros or an average of all the embedding in the vocabulary. This is not the case here as trying another OOV token will produce a different vector/tensor in the small model and a different tensor in the medium model. I'm really interested in understanding the process of going from: And then similarly understanding how that process works for an out-of-vocabulary word in the medium and large models: From my understanding of the LMAO training objective, there is surely a necessity to have a vocabulary of vectors as that is what the language model is trying to predict during training. Any help in clearing this up would be much appreciated! Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
The behavior for This is handled differently internally in the statistical models, which access the raw vectors table directly:
You can see the definitions in spaCy/spacy/ml/models/tok2vec.py Lines 161 to 181 in 27dbbb9 There is currently a bug in how unknown vectors are handled in v3.0, since the unknown vector index ( |
Beta Was this translation helpful? Give feedback.
The behavior for
token.vector
in the python API is confusing because it backs off to thetok2vec
tensor if the model doesn't include any vectors. If the model does include vectors,token.vector
returns a 0-vector for unknown tokens. (Overall, we think this backoff behavior was a mistake in the design oftoken.vector
, but we've kept it since it's been that way for a long time and users may be relying on it.)This is handled differently internally in the statistical models, which access the raw vectors table directly:
sm
:tok2vec
tensor is based on token features (NORM
,SHAPE
, etc.)md
/lg
: same tensor as insm
plus the vector withconcatenate
You can see the definitions in
MultiHashEmbed
…