What's the difference between a tensor and a vector in spaCy 3.0? #6907

koaning · 2021-02-03T10:52:13Z

koaning
Feb 3, 2021

I've been playing around with the API and it seems like the documents/tokens now also have a tensor attribute. It feels like they are related to each other but they seem to have incompatible shapes.

import numpy as np
import spacy 

nlp = spacy.load("en_core_web_md")
doc = nlp("this is a bit of text")

doc.tensor.shape, doc.vector.shape
# ((6, 96), (300,))

It seems like the tensor has a representation for each token, but why is the dimension different (96 vs. 300).

text = doc[-1]
text.tensor.shape, text.vector.shape
# ((96,), (300,))

Looking at the API doc it seems like a tensor is defined as a "Container for dense vector representations." while a vector is "A real-valued meaning representation. Defaults to an average of the token vectors.".

So just so I understand, what's the difference between these two? Am I correct to say that spaCy bundles two sets of embeddings in their models?

Answered by honnibal

Feb 3, 2021

The explanations get a bit fuzzy here because we can define what the thing is "conceptually", but, also, pipelines are allowed to write data to these attributes, and they might choose to use them with different semantics from how we really expect.

We use the doc.tensor attribute to store the contextual token-to-vector encodings computed by the Tok2Vec component. These encodings might be used as features by other components, if they have a Tok2VecListener layer inside their model. The doc.tensor values may or may not be useful to you outside of those modelling decisions, these are learned parameters and all bets are off, really.

The token.vector attribute is usually drawn out of the static…

View full answer

honnibal · 2021-02-03T11:51:50Z

honnibal
Feb 3, 2021
Maintainer

The explanations get a bit fuzzy here because we can define what the thing is "conceptually", but, also, pipelines are allowed to write data to these attributes, and they might choose to use them with different semantics from how we really expect.

We use the doc.tensor attribute to store the contextual token-to-vector encodings computed by the Tok2Vec component. These encodings might be used as features by other components, if they have a Tok2VecListener layer inside their model. The doc.tensor values may or may not be useful to you outside of those modelling decisions, these are learned parameters and all bets are off, really.

The token.vector attribute is usually drawn out of the static word vectors, if the pipeline has some loaded. You can use hooks to customize how that attribute is calculated though, so a pipeline could also be providing something else there. In our pipelines they'll be "word vectors" in the traditional sense though.

9 replies

koaning Feb 5, 2021
Author

@thiippal Supercool! Nice to hear you've found the library to be useful.

One thing about your document though. Are you referring to the tensors in the transformer?

example_doc._.trf_data.tensors

Because I am referring to the tensor that is attached even in non-transformer models.

import spacy 

nlp = spacy.load("en_core_web_md")
example_doc = nlp("i have context")
example_doc.tensor

thiippal Feb 5, 2021

What – I had no idea that a Doc has a tensor attribute as well – should have read the thread more carefully.

Actually good that this came up, since my students are probably going to ask about this as soon as they find the tensor attribute.

stas-sl Jan 27, 2022

Hmm, I'm still a bit confused why vector shape is 300 and tensor is (..., 96). Are they calculated one from another? I mean in the simplest case, nothing fancy, no transformers or any custom pipelines.

stas-sl Jan 28, 2022

Ahh, ok... things seem to be a bit (or a lot) more complicated than I thought initially, but after reading https://spacy.io/api/architectures and https://explosion.ai/blog/deep-learning-formula-nlp, I realized that Tok2Vec uses static word embeddings (token.vector) and other context information to calculate token embeddings and store them in doc.tensor, especially relevant lines from the documentation:

If static vectors are included, a learned linear layer is used to map the vectors to the specified width before concatenating it with the other embedding outputs.

Though, I have another question now.

if not len(self):
    self._vector = xp.zeros((self.vocab.vectors_length,), dtype="f")
    return self._vector
elif self.vocab.vectors.data.size > 0:
    self._vector = sum(t.vector for t in self) / len(self)
    return self._vector
elif self.tensor.size > 0:
    self._vector = self.tensor.mean(axis=0)
    return self._vector

Reading through source code, I see that token vectors which are simple static word embeddings are prioritized over token context aware embeddings when calculating vector for the entire document. Wouldn't it be better to change this order by default as context aware embedding should better represent the meaning?

BTW, both sites spacy and explosion look really cool, I mean the information there of course is very well structured, but I liked the visual appearance a lot!

polm Jan 28, 2022

You're right that it's weird - the fallback behavior in .vector is a design that we've come to regret. It is on our list of things to change - we want to make it more consistent - but for compatibility reasons we can't redefine how the function works without a major version update, so it'll stay as-is for now.

Uh oh!

What's the difference between a tensor and a vector in spaCy 3.0? #6907

Uh oh!

Uh oh!

koaning Feb 3, 2021

Replies: 1 comment · 9 replies

Uh oh!

honnibal Feb 3, 2021 Maintainer

Uh oh!

koaning Feb 5, 2021 Author

Uh oh!

thiippal Feb 5, 2021

Uh oh!

Uh oh!

stas-sl Jan 27, 2022

Uh oh!

Uh oh!

stas-sl Jan 28, 2022

Uh oh!

polm Jan 28, 2022

koaning
Feb 3, 2021

Replies: 1 comment 9 replies

honnibal
Feb 3, 2021
Maintainer

koaning Feb 5, 2021
Author