English and German sm pipelines always return True for has_vector and is_oov Token properties #11170

czappi44 · 2022-07-19T16:08:43Z

czappi44
Jul 19, 2022

Issue in more details

The tokens' has_vector and is_oov properties in the Doc created by "en_core_web_sm" and "de_core_news_sm" pilelines are always True, even for random texts. The vector returned by the vector property is never a full zero vector as it is expected for oov tokens, and correctly returned by the "en_core_web_md" and "de_core_news_md" pipelines (for oov tokens).

How to reproduce the behaviour

English version:

import spacy

nlp_en_sm = spacy.load("en_core_web_sm") 
nlp_en_md = spacy.load("en_core_web_md")
text = "The apple fell from the tree and randomtext."
doc_sm = nlp_en_sm(text)
for token in  doc_sm:
    print(token.text, token.has_vector, token.is_oov)

# unexpected output with en sm:
# The True True
# apple True True
# fell True True
# from True True
# the True True
# tree True True
# and True True
# randomtext True True 
# . True True 

doc_md = nlp_en_md(text)
for token in  doc_md:
    print(token.text, token.has_vector, token.is_oov)

# correct output with en md:
# The True False
# apple True False
# fell True False
# from True False
# the True False
# tree True False
# and True False
# randomtext False True 
# . True False

German version:

import spacy

nlp_de_sm = spacy.load("de_core_news_sm") 
nlp_de_md = spacy.load("de_core_news_md")
text = "Der Apfel fiel vom Baum und randomtext."
doc_sm = nlp_de_sm(text)
for token in  doc_sm:
    print(token.text, token.has_vector, token.is_oov)

# unexpected output with de sm:
# Der True True
# Apfel True True
# fiel True True
# vom True True
# Baum True True
# und True True
# randomtext True True
# . True True

doc_md = nlp_de_md(text)
for token in  doc_md:
    print(token.text, token.has_vector, token.is_oov)

# correct output with de md:
# Der True False
# Apfel True False
# fiel True False
# vom True False
# Baum True False
# und True False
# randomtext False True
# . True False

Your Environment

spaCy version: 3.4.0
Platform: Linux-3.16.0-4-amd64-x86_64-with-glibc2.2.5
Python version: 3.8.12
Pipelines: de_core_news_lg (3.4.0), en_core_web_md (3.4.0), it_core_news_md (3.4.0), es_core_news_md (3.4.0), de_core_news_md (3.4.0), fr_core_news_md (3.4.0), de_core_news_sm (3.4.0), en_core_web_sm (3.4.0), en_core_web_lg (3.4.0)

Answered by adrianeboyd

Jul 20, 2022

This is a bit of a quirk, but it's currently the intended behavior for spacy v2 and v3. The Token objects in a doc have a backoff behavior for vectors that provide the context-sensitive tensors as vectors if doc.tensor is set. doc.tensor is set by a tok2vec component in the pipeline.

If you only apply the tokenizer or use a blank model instead of en_core_web_sm, you can see what it looks like when doc.tensor is not set:

assert nlp.make_doc("text")[0].has_vector is False

If you want to know whether there is a static word vector in nlp.vocab.vectors, you can use token.is_oov or you can check the lexeme rather than the token:

# for an existing Token
nlp("word")[0].lex.has_vector
# look up th…

View full answer

adrianeboyd · 2022-07-20T06:25:54Z

adrianeboyd
Jul 20, 2022

This is a bit of a quirk, but it's currently the intended behavior for spacy v2 and v3. The Token objects in a doc have a backoff behavior for vectors that provide the context-sensitive tensors as vectors if doc.tensor is set. doc.tensor is set by a tok2vec component in the pipeline.

If you only apply the tokenizer or use a blank model instead of en_core_web_sm, you can see what it looks like when doc.tensor is not set:

assert nlp.make_doc("text")[0].has_vector is False

If you want to know whether there is a static word vector in nlp.vocab.vectors, you can use token.is_oov or you can check the lexeme rather than the token:

# for an existing Token
nlp("word")[0].lex.has_vector
# look up the lexeme in the vocab
nlp.vocab["word"].has_vector

Another part of the confusing situation here is that is_oov used to refer to prob/cluster features rather than static word vectors, and when we switched it to check for vectors (in v2.3) we only set it up to check for static word vectors without the same tensor backoff for the .vector properties.

This behavior is definitely confusing and we regret the decision to have the tensor backoff behavior at all. We don't want to change the API/behavior for v3, but improving this is on our to-do list for v4.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

English and German sm pipelines always return True for has_vector and is_oov Token properties #11170

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

English and German sm pipelines always return True for has_vector and is_oov Token properties #11170

Uh oh!

czappi44 Jul 19, 2022

Issue in more details

How to reproduce the behaviour

Your Environment

Replies: 1 comment

Uh oh!

adrianeboyd Jul 20, 2022

czappi44
Jul 19, 2022

adrianeboyd
Jul 20, 2022