if `en_core_wem_sm` model don't have word vectors how it can do other things? #10721

Oscarjia · 2022-04-28T05:45:58Z

Oscarjia
Apr 28, 2022

hi, I am confuse about if en_core_wem_sm model don't have word vectors. why in the config.cf pipeline have 'token2vec' option?
link
I am also confusing about if en_core_wem_sm doesn't have word2vector , how it can do other things like part of speech? In my opinion ,word2vect is the base of other thing.

I think other people also confused about word2vect in the model or how it was used.
https://stackoverflow.com/questions/64253599/spacy-confusion-about-word-vectors-and-tok2vec
https://stackoverflow.com/questions/63144230/proper-way-to-add-new-vectors-for-oov-words/63520262#63520262

Important: needs a pipeline that has word vectors included, for example:
- ✅ en_core_web_md (medium)
- ✅ en_core_web_lg (large)
- 🚫 NOT en_core_web_sm (small)

Answered by ljvmiranda921

Apr 29, 2022

Hi @Oscarjia ,

The en_core_web_sm does have a tok2vec component. The sm tok2vec is based on token features like NORM, PREFIX, SUFFIX, etc. whereas the md / lg has those features plus an external static word vector concatenated into it. To further clarify:

The "vec" in tok2vec points to some vector for a token, not the same vectors as "static word vectors."
In all models (sm/md/lg), the tok2vec component produces context-sensitive tensors that are stored in Doc.tensor

So to answer your question, the sm model does have a tok2vec component based on token features, that's why it can also do those downstream tasks (POS, etc.) and why it has an option in the config file.

View full answer

ljvmiranda921 · 2022-04-29T03:11:07Z

ljvmiranda921
Apr 29, 2022

Hi @Oscarjia ,

The en_core_web_sm does have a tok2vec component. The sm tok2vec is based on token features like NORM, PREFIX, SUFFIX, etc. whereas the md / lg has those features plus an external static word vector concatenated into it. To further clarify:

The "vec" in tok2vec points to some vector for a token, not the same vectors as "static word vectors."
In all models (sm/md/lg), the tok2vec component produces context-sensitive tensors that are stored in Doc.tensor

So to answer your question, the sm model does have a tok2vec component based on token features, that's why it can also do those downstream tasks (POS, etc.) and why it has an option in the config file.

3 replies

Oscarjia Apr 29, 2022
Author

@ljvmiranda921
thanks for your answer. I think i am getting to understand it.
I have another question, when doing NER prediction with en_core_web_md model, will tok2vec affect the prediction of NER model? If so, how will it be affected? will it add tok2vect and static word vector together?

ljvmiranda921 May 3, 2022

Hi @Oscarjia ,

will tok2vec affect the prediction of NER model? If so, how will it be affected?

Yes. Note that tok2vec is an ML component that produces dynamic vectors for your tokens based on their lexical attributes / token features. When you train an NER, you also get to train that tok2vec layer. The vectors produced (assuming it was trained properly) will then be optimized for your NER task.

will it add tok2vect and static word vector together?

Similar to above, the tok2vec layer produces suitable (dynamic) vectors for tokens. The static vectors are added in as a feature of the tok2vec's embed step. So if you're using an md / lg model, you have that external static word vector plus the dynamic vectors from tok2vec.

Oscarjia May 6, 2022
Author

hi, @ljvmiranda921
Thanks for your nice explanation. I can understand more now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

if `en_core_wem_sm` model don't have word vectors how it can do other things? #10721

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

if en_core_wem_sm model don't have word vectors how it can do other things? #10721

Uh oh!

Uh oh!

Oscarjia Apr 28, 2022

Replies: 1 comment · 3 replies

Uh oh!

ljvmiranda921 Apr 29, 2022

Uh oh!

Oscarjia Apr 29, 2022 Author

Uh oh!

Uh oh!

ljvmiranda921 May 3, 2022

Uh oh!

Oscarjia May 6, 2022 Author

if `en_core_wem_sm` model don't have word vectors how it can do other things? #10721

Oscarjia
Apr 28, 2022

Replies: 1 comment 3 replies

ljvmiranda921
Apr 29, 2022

Oscarjia Apr 29, 2022
Author

Oscarjia May 6, 2022
Author