How is the tok2vec component dealing with OOV words #7729

markFriel · 2021-04-09T16:18:45Z

markFriel
Apr 9, 2021

I'm really struggling to understand how spaCy is dealing with out of vocabulary words, in the small model it is able to generate both vectors and tensors for words that would be considered out of vocabulary and in the medium and Large models they can produce tensors even though the word is not in the vector vocabulary. I'll provide a few examples

For the small model:

nlp = spacy.load('en_core_web_sm')
doc = nlp('vjdfvdjfvjdfvjdfv')
print(doc[0].vector)

This produces

[ 2.422895   -0.7658242   1.8101773   0.7139889   1.5895717  -0.3468253
  3.7126827   2.183423   -1.8396597   1.8885428   1.454375   -0.13749766
 -1.640231   -1.3837113   1.2344093  -0.6411735   4.3025064   0.20918235
 -0.6730192   1.5396917   0.36597043 -0.54823786  1.6836596  -1.2005901
 -0.19635314  0.5730252  -0.59181035 -1.4997435   0.91915745 -2.0345182
 -2.2489471  -0.46368507  3.3916192  -1.9320419  -2.379333    2.5703928
  2.468489   -0.62631434 -1.2981727   0.5268611   2.2910137   0.69379675
 -0.7897564   0.48460132  2.6660182  -2.075171    1.4331973   2.3228347
  0.3536415  -1.2179875  -4.002854   -2.823846    1.7919359  -3.1641304
 -0.47899157 -1.3291202   2.2384346  -0.859072   -1.8807596  -0.7975453
 -1.0030665  -0.7637657  -0.45145595 -2.3791122   3.5479336   2.1156929
  0.8188534  -1.543521   -0.04516694  0.09794885 -1.1620023  -0.23596969
  0.3938468  -1.7941633   1.7343104  -0.05509555  0.10251573  0.07245672
  0.02033174  0.5847542   0.49861726 -2.5566545  -1.112858   -1.5641043
 -0.5531893  -0.52983403  1.8544983   0.27309415 -0.28721225 -2.6164517
 -0.8206519   3.0569644  -1.0686791  -0.48635203 -1.5797377  -2.544764  ]

For the medium model:

nlp = spacy.load('en_core_web_md')
doc = nlp('vjdfvdjfvjdfvjdfv')
print(doc[0].vector)
print(doc[0].tensor)

This produces:

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 
[ 1.2331707   2.7975082  -0.93102777 -1.4283049  -2.127201    1.0524485
  1.7880427  -0.59843075 -0.5932133  -1.0777425  -0.23177177  3.3121738
 -1.7070271   0.5217974   0.9345825   1.8892864  -3.1389143   1.512689
  2.9004583   0.23091355  2.256163   -0.6324446  -2.0882528   2.7769039
  1.8825598  -2.416667   -2.217913    2.7621765  -1.2822038   0.16675887
  0.8041003   1.01114    -1.596724    0.35430393  0.922531   -1.0462034
  4.3880625  -4.2297745  -0.48579955 -1.5934823   4.2478733   0.10311007
  3.7748477  -1.8263003   2.6987236  -3.179595    0.40228522 -1.9040544
 -2.5529537   2.631288   -2.7929096   2.5587127  -0.38688615 -0.14991933
 -0.33577868 -0.39589313 -2.6008868   0.7350533  -1.4142923  -1.0010226
  0.05230722  5.2280703  -0.8111588   1.6468449   2.5672708  -2.1335363
 -0.42221212 -1.8009806   1.6327984  -2.6197321   0.31268048 -0.4278906
 -2.917513   -0.8381751  -2.943304   -1.5130916   4.675024    1.3516833
 -2.3503418  -1.1773571  -0.47049734  1.1958872   1.3279772   0.7753616
  0.65126514 -1.8457166   0.05738229 -1.4250764  -1.1298221  -0.0547981
 -1.5668293   2.8856587   2.5321093  -0.5999326  -0.60836697 -0.57734644]

The typical way to produce embedding is to have a vocabulary as a dictionary whose keys are tokens and values are IDs and then the ID is usually used to index into a matrix that contains the embedding for the token, in the case of a token not being in the vocabulary there is usually an ID set aside that all such words will be mapped to, whose embedding could be all zeros or an average of all the embedding in the vocabulary.

This is not the case here as trying another OOV token will produce a different vector/tensor in the small model and a different tensor in the medium model.

I'm really interested in understanding the process of going from:
token-> HashEmbedCNN ->tensor
in the small model as there is no vocabulary attached for it.

And then similarly understanding how that process works for an out-of-vocabulary word in the medium and large models:
token -> ID -> Vector -> HashEmbedCNN ->tensor

From my understanding of the LMAO training objective, there is surely a necessity to have a vocabulary of vectors as that is what the language model is trying to predict during training.

Any help in clearing this up would be much appreciated!

Thanks

Answered by adrianeboyd

Apr 12, 2021

The behavior for token.vector in the python API is confusing because it backs off to the tok2vec tensor if the model doesn't include any vectors. If the model does include vectors, token.vector returns a 0-vector for unknown tokens. (Overall, we think this backoff behavior was a mistake in the design of token.vector, but we've kept it since it's been that way for a long time and users may be relying on it.)

This is handled differently internally in the statistical models, which access the raw vectors table directly:

sm: tok2vec tensor is based on token features (NORM, SHAPE, etc.)
md / lg: same tensor as in sm plus the vector with concatenate

You can see the definitions in MultiHashEmbed …

View full answer

adrianeboyd · 2021-04-12T08:36:59Z

adrianeboyd
Apr 12, 2021

The behavior for token.vector in the python API is confusing because it backs off to the tok2vec tensor if the model doesn't include any vectors. If the model does include vectors, token.vector returns a 0-vector for unknown tokens. (Overall, we think this backoff behavior was a mistake in the design of token.vector, but we've kept it since it's been that way for a long time and users may be relying on it.)

This is handled differently internally in the statistical models, which access the raw vectors table directly:

sm: tok2vec tensor is based on token features (NORM, SHAPE, etc.)
md / lg: same tensor as in sm plus the vector with concatenate

You can see the definitions in MultiHashEmbed here:

spaCy/spacy/ml/models/tok2vec.py

Lines 161 to 181 in 27dbbb9

    
           if include_static_vectors: 
        
               model = chain( 
        
                   concatenate( 
        
                       chain( 
        
                           FeatureExtractor(attrs), 
        
                           list2ragged(), 
        
                           with_array(concatenate(*embeddings)), 
        
                       ), 
        
                       StaticVectors(width, dropout=0.0), 
        
                   ), 
        
                   with_array(Maxout(width, concat_size, nP=3, dropout=0.0, normalize=True)), 
        
                   ragged2list(), 
        
               ) 
        
           else: 
        
               model = chain( 
        
                   FeatureExtractor(list(attrs)), 
        
                   list2ragged(), 
        
                   with_array(concatenate(*embeddings)), 
        
                   with_array(Maxout(width, concat_size, nP=3, dropout=0.0, normalize=True)), 
        
                   ragged2list(), 
        
               )

There is currently a bug in how unknown vectors are handled in v3.0, since the unknown vector index (-1) is accidentally treated as an actual index in the vectors data table rather than as a 0-vector or other unknown vector internally. We are working on a solution: #7674

1 reply

markFriel Apr 12, 2021
Author

Okay thank you, that all makes sense.

In the case of the small models tok2vec training objective, is it the same as the medium and large models in that it is still trying to predict a pre-trained word embedding? Even though none of the pre-trained vectors are shipped with the model.

If this isn't the case then I'm a bit confused in how the tok2vec component for the small model is being trained, in that I'm not sure what it would be trying to predict.

Is the only difference the actual vectors that are shipped with the models?

Thanks for getting back to me, really appreciate the library and all the work that goes into it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How is the tok2vec component dealing with OOV words #7729

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How is the tok2vec component dealing with OOV words #7729

Uh oh!

markFriel Apr 9, 2021

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

adrianeboyd Apr 12, 2021

Uh oh!

markFriel Apr 12, 2021 Author

markFriel
Apr 9, 2021

Replies: 1 comment 1 reply

adrianeboyd
Apr 12, 2021

markFriel Apr 12, 2021
Author