Embedding seem to be calculated before the senter eventhough Transformer is after in the pipeline #9440

oliviercwa · 2021-10-12T15:06:52Z

oliviercwa
Oct 12, 2021

I am confused when the embeddings are calculated when using the Transformers

[EDITED with min repro steps using en_core_web_trf

import spacy
from thinc.util import get_array_module

def norm(vector) -> float:
  xp = get_array_module(vector)  
  total = (vector**2).sum()
  return xp.sqrt(total) if total != 0. else 0

nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'lemmatizer', 'ner', 'attribute_ruler'])
nlp.add_pipe('senter', first=True, source=spacy.load("en_core_web_sm"))


result1 = nlp('This is sentence 1. This is sentence 2')
result2 = nlp('This is sentence 2')

sent_r1 = [x for x in result1.sents][1]
sent_r2 = [x for x in result2.sents][0]

# Text match
assert(sent_r1.text == sent_r2.text)   # PASS

# Get vectors
tensor_ix1 = result1._.trf_data.align[sent_r1.start: sent_r1.end].data.flatten()
out_dim = result1._.trf_data.tensors[0].shape[-1]
tensor1 = result1._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix1]

tensor_ix2 = result2._.trf_data.align[sent_r2.start: sent_r2.end].data.flatten()
out_dim = result2._.trf_data.tensors[0].shape[-1]
tensor2 = result2._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix2]

# Vectors are very different. Context influences the vector in the first example even though the transformer is after the senter
print(norm(tensor1-tensor2))    # 19.4... Expected be 0 or close to 0

Previous message with longer pipeline

I have the following pipeline which was created by sourcing the senter from 'en_core_web_sm'. (full config below)

Please note that the tensor2attr is just a component that computes the vectors from the word pieces to the word/span/doc vectors (borrowed from https://applied-language-technology.readthedocs.io/en/latest/notebooks/part_iii/04_embeddings_continued.html)

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "xx"
pipeline = ["senter","transformer","tensor2attr"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.senter]
factory = "senter"

[components.senter.model]
@architectures = "spacy.Tagger.v1"
nO = null

[components.senter.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.senter.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 16
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [1000,500,500,500]
include_static_vectors = false

[components.senter.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 16
depth = 2
window_size = 1
maxout_pieces = 2

[components.tensor2attr]
factory = "tensor2attr"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "xlm-roberta-base"

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 510
stride = 100

[components.transformer.model.tokenizer_config]
use_fast = true

Please note that the transformer is after the senter, so I would expect the embeddings to be calculated using the senter output

When I check the embeddings returned by Sentence 2 for the full doc, they differ from the embeddings when only calling with "Sentence 2".

result1 = nlp('This is sentence 1. This is sentence 2')
resuit2 = nlp('This is sentence 2')

sent_res1 = [x for x in result1.sents]
sent_res2 = [x for x in result2.sents] 

assert(sent_res1[1].text == sent_res2[0].text)  # Passes
assert(sent_res1[1].vector == sent_res2[0].vector)   # Fails

Note that Tensor2attr can be removed and the issue exposed by grabbing directly the vectors from trf_data

out_dim = result1._.trf_data.tensors[0].shape[-1]

tensor_ix1 = result1._.trf_data.align[sent_res1[1].start: sent_res1[1].end].data.flatten()
tensor1 = result1._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix1]

tensor_ix2 = resuit2._.trf_data.align[sent_res2[0].start: sent_res2[0].end].data.flatten()
tensor2 = resuit2._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix2]

norm(tensor1-tensor2)   # About 11.34 ==> far from 0

So it seems that the embedding are computed before the senter and they get affected by the surrounding context even though the senter is first in the pipe. Am I missing something ?

Answered by adrianeboyd

Oct 13, 2021

If you don't want the default overlapping strided spans, you need to use a different "span getter". There's one for sentences, but you need to be sure that senter is in annotating_components for the sentences to be set at the point when the transformer component runs. See: https://spacy.io/api/transformer#span_getters. If a sentence span is too long for the transformer model, it will be truncated.

View full answer

oliviercwa · 2021-10-13T09:32:30Z

oliviercwa
Oct 13, 2021
Author

Well I think I just realized my mistake. The transformer operates on the doc and not on the individual sentences of the doc, hence the second sentence vector in the first example are affected by the rest of the doc...

Is there a way to tell the transformer to "ignore" parts of the doc when computing vectors ? Or do I need to have two separate pipelines, one to cut into sentences, then send each sentence to a second pipeline essentially transforming the sentences into full doc ?

0 replies

adrianeboyd · 2021-10-13T12:38:45Z

adrianeboyd
Oct 13, 2021

If you don't want the default overlapping strided spans, you need to use a different "span getter". There's one for sentences, but you need to be sure that senter is in annotating_components for the sentences to be set at the point when the transformer component runs. See: https://spacy.io/api/transformer#span_getters. If a sentence span is too long for the transformer model, it will be truncated.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Embedding seem to be calculated before the senter eventhough Transformer is after in the pipeline #9440

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Embedding seem to be calculated before the senter eventhough Transformer is after in the pipeline #9440

Uh oh!

Uh oh!

oliviercwa Oct 12, 2021

Replies: 2 comments

Uh oh!

oliviercwa Oct 13, 2021 Author

Uh oh!

adrianeboyd Oct 13, 2021

oliviercwa
Oct 12, 2021

oliviercwa
Oct 13, 2021
Author

adrianeboyd
Oct 13, 2021