Skip to content
Discussion options

You must be logged in to vote

Hi @Oscarjia ,

The en_core_web_sm does have a tok2vec component. The sm tok2vec is based on token features like NORM, PREFIX, SUFFIX, etc. whereas the md / lg has those features plus an external static word vector concatenated into it. To further clarify:

  • The "vec" in tok2vec points to some vector for a token, not the same vectors as "static word vectors."
  • In all models (sm/md/lg), the tok2vec component produces context-sensitive tensors that are stored in Doc.tensor

So to answer your question, the sm model does have a tok2vec component based on token features, that's why it can also do those downstream tasks (POS, etc.) and why it has an option in the config file.

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@Oscarjia
Comment options

@ljvmiranda921
Comment options

@Oscarjia
Comment options

Answer selected by adrianeboyd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / en English language data and models feat / vectors Feature: Word vectors and similarity feat / tok2vec Feature: Token-to-vector layer and pretraining
2 participants