Embedding seem to be calculated before the senter eventhough Transformer is after in the pipeline #9440
-
I am confused when the embeddings are calculated when using the Transformers [EDITED with min repro steps using en_core_web_trf import spacy
from thinc.util import get_array_module
def norm(vector) -> float:
xp = get_array_module(vector)
total = (vector**2).sum()
return xp.sqrt(total) if total != 0. else 0
nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'lemmatizer', 'ner', 'attribute_ruler'])
nlp.add_pipe('senter', first=True, source=spacy.load("en_core_web_sm"))
result1 = nlp('This is sentence 1. This is sentence 2')
result2 = nlp('This is sentence 2')
sent_r1 = [x for x in result1.sents][1]
sent_r2 = [x for x in result2.sents][0]
# Text match
assert(sent_r1.text == sent_r2.text) # PASS
# Get vectors
tensor_ix1 = result1._.trf_data.align[sent_r1.start: sent_r1.end].data.flatten()
out_dim = result1._.trf_data.tensors[0].shape[-1]
tensor1 = result1._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix1]
tensor_ix2 = result2._.trf_data.align[sent_r2.start: sent_r2.end].data.flatten()
out_dim = result2._.trf_data.tensors[0].shape[-1]
tensor2 = result2._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix2]
# Vectors are very different. Context influences the vector in the first example even though the transformer is after the senter
print(norm(tensor1-tensor2)) # 19.4... Expected be 0 or close to 0 Previous message with longer pipeline I have the following pipeline which was created by sourcing the senter from 'en_core_web_sm'. (full config below) Please note that the tensor2attr is just a component that computes the vectors from the word pieces to the word/span/doc vectors (borrowed from https://applied-language-technology.readthedocs.io/en/latest/notebooks/part_iii/04_embeddings_continued.html)
Please note that the transformer is after the senter, so I would expect the embeddings to be calculated using the senter output When I check the embeddings returned by Sentence 2 for the full doc, they differ from the embeddings when only calling with "Sentence 2". result1 = nlp('This is sentence 1. This is sentence 2')
resuit2 = nlp('This is sentence 2')
sent_res1 = [x for x in result1.sents]
sent_res2 = [x for x in result2.sents]
assert(sent_res1[1].text == sent_res2[0].text) # Passes
assert(sent_res1[1].vector == sent_res2[0].vector) # Fails Note that Tensor2attr can be removed and the issue exposed by grabbing directly the vectors from trf_data out_dim = result1._.trf_data.tensors[0].shape[-1]
tensor_ix1 = result1._.trf_data.align[sent_res1[1].start: sent_res1[1].end].data.flatten()
tensor1 = result1._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix1]
tensor_ix2 = resuit2._.trf_data.align[sent_res2[0].start: sent_res2[0].end].data.flatten()
tensor2 = resuit2._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix2]
norm(tensor1-tensor2) # About 11.34 ==> far from 0 So it seems that the embedding are computed before the senter and they get affected by the surrounding context even though the senter is first in the pipe. Am I missing something ? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Well I think I just realized my mistake. The transformer operates on the doc and not on the individual sentences of the doc, hence the second sentence vector in the first example are affected by the rest of the doc... Is there a way to tell the transformer to "ignore" parts of the doc when computing vectors ? Or do I need to have two separate pipelines, one to cut into sentences, then send each sentence to a second pipeline essentially transforming the sentences into full doc ? |
Beta Was this translation helpful? Give feedback.
-
If you don't want the default overlapping strided spans, you need to use a different "span getter". There's one for sentences, but you need to be sure that |
Beta Was this translation helpful? Give feedback.
If you don't want the default overlapping strided spans, you need to use a different "span getter". There's one for sentences, but you need to be sure that
senter
is inannotating_components
for the sentences to be set at the point when the transformer component runs. See: https://spacy.io/api/transformer#span_getters. If a sentence span is too long for the transformer model, it will be truncated.