TRF model Doc vector on one word sentence totally different from the word vector itself #9536

oliviercwa · 2021-10-20T08:32:08Z

oliviercwa
Oct 20, 2021

How to reproduce the behaviour

It might not be a bug but I find the result very surprising

import spacy
import numpy as np
from thinc.util import get_array_module

def norm(vector) -> float:
  xp = get_array_module(vector)  
  total = (vector**2).sum()
  return xp.sqrt(total) if total != 0. else 0

# Define a one word sentence
nlp_trf = spacy.load('en_core_web_trf')
doc = nlp_trf('VESSEL')

# Get doc vector
doc_vect = doc._.trf_data.tensors[-1].mean(axis=0)

# Get span vector for the full doc
span = doc[:]
tensor_ix = span.doc._.trf_data.align[span.start: span.end].data.flatten()
out_dim = span.doc._.trf_data.tensors[0].shape[-1]
tensor = span.doc._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix]
span_vect = tensor.mean(axis=0)

# Print similarity. Expected to be close to 1. Instead it's totally disimilar
print(np.dot(doc_vect, span_vect) / (norm(doc_vect) * norm(span_vect)))
-0.013732557  # The two are totally disimilar

Your Environment

Operating System: Windows 10
Python Version Used: 3.7
spaCy Version Used: 3.1.3

Answered by adrianeboyd

Oct 25, 2021

Hi, the difference is whether you're including the special tokens or not. If you treat the doc the same way as the span, you get the same results:

doc_vect = doc._.trf_data.tensors[-1].mean(axis=0)
tensor_ix = doc._.trf_data.align[0: len(doc)].data.flatten()
out_dim = doc._.trf_data.tensors[0].shape[-1]
tensor = doc._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix]
doc_vect = tensor.mean(axis=0)

There are five tokens on the transformer side (['<s>', 'V', 'ESS', 'EL', '</s>']) and the alignment to "VESSEL" in trf_data.align does not include the <s> and </s> tokens.

View full answer

adrianeboyd · 2021-10-25T08:26:53Z

adrianeboyd
Oct 25, 2021

Hi, the difference is whether you're including the special tokens or not. If you treat the doc the same way as the span, you get the same results:

doc_vect = doc._.trf_data.tensors[-1].mean(axis=0)
tensor_ix = doc._.trf_data.align[0: len(doc)].data.flatten()
out_dim = doc._.trf_data.tensors[0].shape[-1]
tensor = doc._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix]
doc_vect = tensor.mean(axis=0)

There are five tokens on the transformer side (['<s>', 'V', 'ESS', 'EL', '</s>']) and the alignment to "VESSEL" in trf_data.align does not include the <s> and </s> tokens.

1 reply

oliviercwa Oct 25, 2021
Author

Thanks. That makes sense.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

TRF model Doc vector on one word sentence totally different from the word vector itself #9536

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

TRF model Doc vector on one word sentence totally different from the word vector itself #9536

Uh oh!

oliviercwa Oct 20, 2021

How to reproduce the behaviour

Your Environment

Replies: 1 comment · 1 reply

Uh oh!

adrianeboyd Oct 25, 2021

Uh oh!

oliviercwa Oct 25, 2021 Author

oliviercwa
Oct 20, 2021

Replies: 1 comment 1 reply

adrianeboyd
Oct 25, 2021

oliviercwa Oct 25, 2021
Author