English and German sm pipelines always return True for has_vector and is_oov Token properties #11170
-
Issue in more detailsThe tokens' has_vector and is_oov properties in the Doc created by "en_core_web_sm" and "de_core_news_sm" pilelines are always True, even for random texts. The vector returned by the vector property is never a full zero vector as it is expected for oov tokens, and correctly returned by the "en_core_web_md" and "de_core_news_md" pipelines (for oov tokens). How to reproduce the behaviourEnglish version:
German version:
Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
This is a bit of a quirk, but it's currently the intended behavior for spacy v2 and v3. The If you only apply the tokenizer or use a blank model instead of assert nlp.make_doc("text")[0].has_vector is False If you want to know whether there is a static word vector in # for an existing Token
nlp("word")[0].lex.has_vector
# look up the lexeme in the vocab
nlp.vocab["word"].has_vector Another part of the confusing situation here is that This behavior is definitely confusing and we regret the decision to have the tensor backoff behavior at all. We don't want to change the API/behavior for v3, but improving this is on our to-do list for v4. |
Beta Was this translation helpful? Give feedback.
This is a bit of a quirk, but it's currently the intended behavior for spacy v2 and v3. The
Token
objects in a doc have a backoff behavior for vectors that provide the context-sensitive tensors as vectors ifdoc.tensor
is set.doc.tensor
is set by atok2vec
component in the pipeline.If you only apply the tokenizer or use a blank model instead of
en_core_web_sm
, you can see what it looks like whendoc.tensor
is not set:If you want to know whether there is a static word vector in
nlp.vocab.vectors
, you can usetoken.is_oov
or you can check the lexeme rather than the token: