Would like some explanation on how embedding for words in sentence works #7640

victorconan · 2021-04-02T00:55:19Z

victorconan
Apr 2, 2021

I've noticed that the embedding for words in a sentence is different than the embedding for the standalone words.

For example,

import spacy

nlp = spacy.load('en_core_web_sm')

if we look at a single word:

nlp("car").vector

Out[84]: array([ 1.2932893 ,  0.89211553,  0.62943053, -0.2729334 , -0.5341986 ,
        0.64496064, -0.5201328 , -1.1334405 , -0.62704134, -0.22807677,
        0.48515025,  0.57020843,  0.07102954,  0.34108397, -0.83850193,
        0.95933235,  0.47873932, -0.79558283, -0.548351  ,  0.06075529,
       -0.23620251, -0.11214575, -0.9001619 ,  0.6342123 , -0.8218557 ,
        1.069275  , -0.12549011, -0.16396616, -0.00925559, -0.2523191 ,
       -0.88296926, -0.71699727, -1.0214779 , -0.73345566,  0.98414665,
        0.95230246, -0.9577873 , -0.17252769, -0.01143767,  1.5868278 ,
       -0.589056  , -1.2015647 ,  0.08179536,  1.5552956 , -0.38374537,
       -0.458027  , -0.36344165,  0.6528815 ,  1.1343954 , -0.26895618,
       -0.17208925,  0.26777846, -1.044564  ,  0.67978543,  0.7935814 ,
       -1.0002869 ,  0.75341046,  0.19422644,  1.266013  ,  0.45642895,
       -0.8974124 , -0.99229026,  0.23133281, -0.7023017 ,  0.1296661 ,
       -0.39340246, -0.00519174, -0.6150956 , -0.6560764 ,  0.44109195,
        1.3358614 , -0.4082019 ,  0.75392187, -0.550416  , -0.5694128 ,
       -0.18704559, -0.7524568 , -0.27037886,  0.27312753, -0.44995415,
       -0.67563087,  0.05636543, -0.02513152,  0.08813564, -0.7342583 ,
       -0.2822349 , -0.2708624 , -0.38199502, -0.7184252 ,  0.09366159,
        0.22780108,  1.7352986 ,  0.8630673 ,  0.18214986, -0.8129621 ,
       -0.7014377 ], dtype=float32)

But if we look at a sentence, and the embedding for the same word becomes different:

doc = "this is a car"

nlp_doc = nlp(doc)

nlp_doc[3].vector

Out[87]: array([ 1.8525549 ,  0.8691489 ,  1.4260403 , -0.17725492, -0.37783206,
        0.15884459, -1.3154674 , -0.3868883 , -0.47794542, -1.1995059 ,
        0.73498285, -0.17705032,  0.27135873,  0.04127741, -0.9851756 ,
        1.1479096 ,  0.09292361, -0.5811816 , -0.22138062, -0.2693836 ,
       -0.40218544, -0.43550307,  0.6113397 ,  0.33243114, -1.1999955 ,
        0.77038884, -0.20211262, -0.3360788 , -0.46768472,  0.1453064 ,
       -1.0939435 , -0.11621541,  0.04813814, -0.55588114,  1.1784616 ,
        0.957164  , -0.18104267, -0.7331146 , -0.10417141,  1.5620672 ,
       -0.75062007, -0.64942837, -0.2015726 ,  1.3860286 , -0.5172343 ,
        0.0459286 ,  0.31913   ,  1.2248486 ,  0.7420496 , -0.18741007,
        0.30759788, -0.6121938 , -0.68762094,  0.1983377 ,  0.26927865,
       -0.91401565,  0.8435845 , -0.3635754 ,  0.9583954 ,  0.46181422,
       -1.5954362 , -0.34139878, -0.12381385, -0.6065661 , -0.951993  ,
       -0.5094518 , -0.26196384, -0.887941  ,  0.7160679 ,  0.21823291,
        0.781784  , -0.6542833 ,  0.6261398 , -0.5182543 , -0.12115611,
       -0.4608132 , -0.40857542, -0.72190213,  0.48179612, -0.3528359 ,
        0.09566839,  0.58682734, -0.09030151,  0.49966657, -0.6913105 ,
       -0.8362236 , -0.01420918, -0.7607528 , -1.0345167 , -0.09805884,
        0.15354674,  1.3257725 ,  1.2911118 ,  0.742655  , -0.8077169 ,
       -0.1940285 ], dtype=float32)

if we change the sentence, the embedding for the same word also changes:

doc = "i bought a car"

nlp_doc = nlp(doc)

nlp_doc[3].vector

Out[88]: array([ 1.7893982 ,  0.81619924,  1.1657932 , -0.23429456, -0.20057146,
        0.09790367, -1.2531705 , -0.17687944, -0.2221385 , -1.2493887 ,
        1.1955984 , -0.23677754,  0.26778352,  0.01291785, -0.8777611 ,
        1.0151966 , -0.12619533, -0.6445136 , -0.23461026, -0.04674315,
       -0.5939668 , -0.43940455,  0.8051928 ,  0.03284994, -1.3251175 ,
        0.68906176, -0.13861665, -0.41133773, -0.47591862,  0.18556637,
       -1.2536714 , -0.23220716,  0.2529897 , -0.44740823,  1.2114718 ,
        0.99981606, -0.12137032, -0.87261766, -0.01000267,  1.7019806 ,
       -0.38576046, -0.5112059 , -0.36575544,  1.2088654 , -0.37044027,
        0.42940372,  0.4327159 ,  1.3291451 ,  0.9711874 , -0.21240325,
        0.43818855, -0.5094477 , -0.5464864 ,  0.05168164, -0.14227471,
       -0.9849976 ,  0.7158905 , -0.2975551 ,  0.86627567,  0.16311987,
       -1.7606223 , -0.52225894, -0.12032069, -0.84829473, -1.035707  ,
       -0.44493306, -0.4441608 , -1.129323  ,  0.91584206,  0.62977296,
        0.9138051 , -0.70307124,  0.55166745, -0.6177671 ,  0.03751675,
       -0.5088403 , -0.42626148, -0.8094436 ,  0.4712116 , -0.17269418,
        0.23723996,  0.32993653,  0.07194555,  0.50055283, -0.5160725 ,
       -1.1451712 , -0.16201094, -0.7963222 , -0.9905835 , -0.11711894,
        0.01816325,  1.4161136 ,  1.282727  ,  0.64843816, -0.8833978 ,
       -0.21481356], dtype=float32)

Would be great to have some documentations on why the embedding of word in a sentence is different than the one of standalone word. Is there any algorithm spacy applied to adjust the word embedding in a sentence?

Thanks!

Answered by polm

Apr 2, 2021

This is admittedly kind of confusing, the way the .vector method works could maybe use some more detail in the docs.

Vectors can come from three different places, which are checked in this order:

User hooks
(if no vectors) Doc.tensor (if available)
A vector lookup table

You can see this in the source for the method, which is pretty succinct.

What's happening is in the small model there is no vector table, so the vector representation comes from Doc.tensor, which is set by tok2vec. This uses a CNN with a small window, so neighboring tokens can affect the representation of an individual token. If you make a long sentence and just change the early words you can see the later words are unaf…

View full answer

svlandeg · 2021-04-02T09:20:26Z

svlandeg
Apr 2, 2021

Hi! This is a great question for the discusssion forum, so I'll move it there. This specific issue will be closed, but you'll get a link/forward to the open thread.

0 replies

polm · 2021-04-02T10:36:20Z

polm
Apr 2, 2021

This is admittedly kind of confusing, the way the .vector method works could maybe use some more detail in the docs.

Vectors can come from three different places, which are checked in this order:

User hooks
(if no vectors) Doc.tensor (if available)
A vector lookup table

You can see this in the source for the method, which is pretty succinct.

What's happening is in the small model there is no vector table, so the vector representation comes from Doc.tensor, which is set by tok2vec. This uses a CNN with a small window, so neighboring tokens can affect the representation of an individual token. If you make a long sentence and just change the early words you can see the later words are unaffected.

Does that answer your question?

3 replies

victorconan Apr 4, 2021
Author

I see. Thanks for the clarification! so there is no such problem for large model, right? I tried those out with the large model, it seems now "car" has the same embedding no matter what sentences it is in. I believe the large model has vector table so it will always use Option 3 - "A vector lookup table", right?

polm Apr 5, 2021

That's right, the medium and large models have word vector tables so they just return constant values. Only the Transformer and small models have contextual vectors.

victorconan Apr 5, 2021
Author

Great! Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Would like some explanation on how embedding for words in sentence works #7640

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Would like some explanation on how embedding for words in sentence works #7640

Uh oh!

victorconan Apr 2, 2021

Replies: 2 comments · 3 replies

Uh oh!

svlandeg Apr 2, 2021

Uh oh!

Uh oh!

polm Apr 2, 2021

Uh oh!

victorconan Apr 4, 2021 Author

Uh oh!

polm Apr 5, 2021

Uh oh!

victorconan Apr 5, 2021 Author

victorconan
Apr 2, 2021

Replies: 2 comments 3 replies

svlandeg
Apr 2, 2021

polm
Apr 2, 2021

victorconan Apr 4, 2021
Author

victorconan Apr 5, 2021
Author