Skip to content
Discussion options

You must be logged in to vote

The behavior for token.vector in the python API is confusing because it backs off to the tok2vec tensor if the model doesn't include any vectors. If the model does include vectors, token.vector returns a 0-vector for unknown tokens. (Overall, we think this backoff behavior was a mistake in the design of token.vector, but we've kept it since it's been that way for a long time and users may be relying on it.)

This is handled differently internally in the statistical models, which access the raw vectors table directly:

  • sm: tok2vec tensor is based on token features (NORM, SHAPE, etc.)
  • md / lg: same tensor as in sm plus the vector with concatenate

You can see the definitions in MultiHashEmbed

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@markFriel
Comment options

Answer selected by polm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tok2vec Feature: Token-to-vector layer and pretraining
2 participants