Token vectors are empty using en_core_web_trf model #8061

polm · 2021-05-11T08:42:21Z

polm
May 11, 2021

This is a copy of #8037 / #8047 because for some reason the migration on that won't finish. This was originally posted by @erip.

How to reproduce the behaviour

>>> import spacy
>>> spacy.__version__
'3.0.6'
>>> nlp = spacy.load('en_core_web_trf')
>>> s = "Hello, my name is Elijah."
>>> doc = nlp(s)
>>> doc[0].vector
array([], dtype=float32)

Your Environment

Info about spaCy

spaCy version: 3.0.6
Platform: macOS-10.16-x86_64-i386-64bit
Python version: 3.8.8
Pipelines: en_core_web_lg (3.0.0), en_core_web_trf (3.0.0)

I imagine the model is trained on subwords, so maybe the alignment between those and tokens in the spaCy since is causing issues?

Answered by polm

May 11, 2021

Sorry for the delayed reply on this, not sure what's up with the issue migration.

This is a design decision and not a bug. Basically the .vector api is only for static word vectors, not for contextual vectors like those generated by the Transformer. The Transformer models in spaCy don't include static word vectors because if you have Transformers you usually don't need them. If you need per-token representations, what you can do instead is use the data in doc._.trf_data, which contains tensors, wordpieces, and an alignment between spaCy tokens and the wordpieces. (I'm not sure there's a guide to this anywhere yet.)

View full answer

polm · 2021-05-11T08:54:51Z

polm
May 11, 2021
Author

Sorry for the delayed reply on this, not sure what's up with the issue migration.

This is a design decision and not a bug. Basically the .vector api is only for static word vectors, not for contextual vectors like those generated by the Transformer. The Transformer models in spaCy don't include static word vectors because if you have Transformers you usually don't need them. If you need per-token representations, what you can do instead is use the data in doc._.trf_data, which contains tensors, wordpieces, and an alignment between spaCy tokens and the wordpieces. (I'm not sure there's a guide to this anywhere yet.)

1 reply

polm May 11, 2021
Author

I will add this is a bit confusing because the small models don't have static vectors, but fall back to the tensor attribute when you use the vector API. The Transformers models don't have the tensor attribute either, due to the Wordpiece/spaCy token mapping issue, so you just get an empty array.

The consistency of this could probably be improved but I don't think we have any concrete plans now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Token vectors are empty using en_core_web_trf model #8061

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Token vectors are empty using en_core_web_trf model #8061

Uh oh!

polm May 11, 2021

How to reproduce the behaviour

Your Environment

Replies: 1 comment · 1 reply

Uh oh!

polm May 11, 2021 Author

Uh oh!

polm May 11, 2021 Author

polm
May 11, 2021

Replies: 1 comment 1 reply

polm
May 11, 2021
Author

polm May 11, 2021
Author