Best way to get contextual vector embeddings of tokens in a document #10361

phosseini · 2022-02-23T20:44:41Z

phosseini
Feb 23, 2022

This is what I am trying to do and I appreciate any thoughts on what would be the best practice to achieve that. I have a collection of documents. And for every document, I need to compute some metrics. To compute some of these metrics, I need to have the vector representation/embeddings (ideally I want the word embeddings from Transformers since I want to compute contextual vector embeddings). I may also need to filter out tokens based on their Part of Speech Tags (POS). So I'm trying to figure what the best way is to have the POS tags and vector representation of tokens. So here is basically the input and output:

Input: [Doc_1, Doc_2, ..., Doc_n]
Output (e.g., for Doc_1): Token_1, Token_2, ..., Token_n where Token_* are Token objects with the word embeddings vector.

One point to consider is that depending on what tokenizer is going to be used, we know that there is a chance a word is split into sub-tokens each with its own word embedding vector. Ideally, I don't want to deal with pooling the vectors and I want one single vector embedding for each token with a POS tag in the document.

Currently, I've tried the following code:

import spacy

nlp = spacy.load("en_core_web_trf")
for doc in nlp.pipe(["some text"]):
    tokvecs = doc._.trf_data.tensors[-1]

However, there are two issues here: 1) as far as I understand tokvecs are just static vector embeddings and not the contextual embeddings from Transformers? 2) How can we align these tokens with Token objects so that we get access to Token object features such as POS tags?

Also, I noticed that for each Token in the Doc, there is a vector feature which is supposed to be the semantic representation of the Token. However, I see that not all tokens have such a vector representation? Then the question is how we can get such a vector?

UPDATE: I just came across this useful discussion and realized that we have a tensor attribute for Doc that its shape is aligned with the tokens we have in a document. However, I also noticed when we use the en_core_web_trf the tensor is not defined and is empty. Then my question is if it's safe to assume the tensor when we use a general model such as en_core_web_md instead of en_core_web_trf basically gives us the contextual word embeddings of tokens in the doc?

Answered by adrianeboyd

Feb 24, 2022

This is a good overview and includes a section on embeddings from trf models: https://applied-language-technology.mooc.fi/html/notebooks/part_iii/05_embeddings_continued.html

View full answer

adrianeboyd · 2022-02-24T10:23:13Z

adrianeboyd
Feb 24, 2022

This is a good overview and includes a section on embeddings from trf models: https://applied-language-technology.mooc.fi/html/notebooks/part_iii/05_embeddings_continued.html

1 reply

phosseini Feb 24, 2022
Author

@adrianeboyd Super helpful, that's exactly what I was looking for. Thanks for sharing (I felt like that blog post is written solely to answer my question!).

For those who are wondering, basically here is how you want to have the contextual vector embeddings in spaCy's Token objects: first add the beautiful tensor2attr factory written here to your code and then add it to the en_core_web_trf pipeline (since the Transformers tokenization where we get the contextual embeddings from is different than spaCy's tokenization, this factory helps us align the tokens and their embeddings). Then simply call the pipeline and you'll have the contextual vectors in the vector attribute for each token. Here's how:

import spacy
import numpy as np
from spacy.language import Language


# We use the @ character to register the following Class definition
# with spaCy under the name 'tensor2attr'.
@Language.factory('tensor2attr')
# We begin by declaring the class name: Tensor2Attr. The name is
# declared using 'class', followed by the name and a colon.
class Tensor2Attr:

    # We continue by defining the first method of the class,
    # __init__(), which is called when this class is used for
    # creating a Python object. Custom components in spaCy
    # require passing two variables to the __init__() method:
    # 'name' and 'nlp'. The variable 'self' refers to any
    # object created using this class!
    def __init__(self, name, nlp):
        # We do not really do anything with this class, so we
        # simply move on using 'pass' when the object is created.
        pass

    # The __call__() method is called whenever some other object
    # is passed to an object representing this class. Since we know
    # that the class is a part of the spaCy pipeline, we already know
    # that it will receive Doc objects from the preceding layers.
    # We use the variable 'doc' to refer to any object received.
    def __call__(self, doc):
        # When an object is received, the class will instantly pass
        # the object forward to the 'add_attributes' method. The
        # reference to self informs Python that the method belongs
        # to this class.
        self.add_attributes(doc)

        # After the 'add_attributes' method finishes, the __call__
        # method returns the object.
        return doc

    # Next, we define the 'add_attributes' method that will modify
    # the incoming Doc object by calling a series of methods.
    def add_attributes(self, doc):
        # spaCy Doc objects have an attribute named 'user_hooks',
        # which allows customising the default attributes of a
        # Doc object, such as 'vector'. We use the 'user_hooks'
        # attribute to replace the attribute 'vector' with the
        # Transformer output, which is retrieved using the
        # 'doc_tensor' method defined below.
        doc.user_hooks['vector'] = self.doc_tensor

        # We then perform the same for both Spans and Tokens that
        # are contained within the Doc object.
        doc.user_span_hooks['vector'] = self.span_tensor
        doc.user_token_hooks['vector'] = self.token_tensor

        # We also replace the 'similarity' method, because the
        # default 'similarity' method looks at the default 'vector'
        # attribute, which is empty! We must first replace the
        # vectors using the 'user_hooks' attribute.
        doc.user_hooks['similarity'] = self.get_similarity
        doc.user_span_hooks['similarity'] = self.get_similarity
        doc.user_token_hooks['similarity'] = self.get_similarity

    # Define a method that takes a Doc object as input and returns
    # Transformer output for the entire Doc.
    def doc_tensor(self, doc):
        # Return Transformer output for the entire Doc. As noted
        # above, this is the last item under the attribute 'tensor'.
        # Average the output along axis 0 to handle batched outputs.
        return doc._.trf_data.tensors[-1].mean(axis=0)

    # Define a method that takes a Span as input and returns the Transformer
    # output.
    def span_tensor(self, span):
        # Get alignment information for Span. This is achieved by using
        # the 'doc' attribute of Span that refers to the Doc that contains
        # this Span. We then use the 'start' and 'end' attributes of a Span
        # to retrieve the alignment information. Finally, we flatten the
        # resulting array to use it for indexing.
        tensor_ix = span.doc._.trf_data.align[span.start: span.end].data.flatten()

        # Fetch Transformer output shape from the final dimension of the output.
        # We do this here to maintain compatibility with different Transformers,
        # which may output tensors of different shape.
        out_dim = span.doc._.trf_data.tensors[0].shape[-1]

        # Get Token tensors under tensors[0]. Reshape batched outputs so that
        # each "row" in the matrix corresponds to a single token. This is needed
        # for matching alignment information under 'tensor_ix' to the Transformer
        # output.
        tensor = span.doc._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix]

        # Average vectors along axis 0 ("columns"). This yields a 768-dimensional
        # vector for each spaCy Span.
        return tensor.mean(axis=0)

    # Define a function that takes a Token as input and returns the Transformer
    # output.
    def token_tensor(self, token):
        # Get alignment information for Token; flatten array for indexing.
        # Again, we use the 'doc' attribute of a Token to get the parent Doc,
        # which contains the Transformer output.
        tensor_ix = token.doc._.trf_data.align[token.i].data.flatten()

        # Fetch Transformer output shape from the final dimension of the output.
        # We do this here to maintain compatibility with different Transformers,
        # which may output tensors of different shape.
        out_dim = token.doc._.trf_data.tensors[0].shape[-1]

        # Get Token tensors under tensors[0]. Reshape batched outputs so that
        # each "row" in the matrix corresponds to a single token. This is needed
        # for matching alignment information under 'tensor_ix' to the Transformer
        # output.
        tensor = token.doc._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix]

        # Average vectors along axis 0 (columns). This yields a 768-dimensional
        # vector for each spaCy Token.
        return tensor.mean(axis=0)

    # Define a function for calculating cosine similarity between vectors
    def get_similarity(self, doc1, doc2):
        # Calculate and return cosine similarity
        return np.dot(doc1.vector, doc2.vector) / (doc1.vector_norm * doc2.vector_norm)


nlp_trf = spacy.load('en_core_web_trf')
nlp_trf.add_pipe('tensor2attr')

# example to see how it works
doc_city_trf = nlp_trf("Helsinki is the capital of Finland.")
doc_money_trf = nlp_trf("The company is close to bankruptcy because its capital is gone.")

# Retrieve vectors for the two Tokens corresponding to "capital";
# assign to variables 'city_trf' and 'money_trf'.
city_trf = doc_city_trf[3]
money_trf = doc_money_trf[8]
# city_trf and money_trf are Token objects and their contextual vector embeddings are available in vector attribute

# Compare the similarity of the two meanings of 'capital'
sim = city_trf.similarity(money_trf)

print(sim)
> 0.6084695
# the similarity shows vector embeddings are contextual and not static since otherwise, we should have 1.0 for similarity.

lingvisa · 2022-07-22T21:26:51Z

lingvisa
Jul 22, 2022

This looks good. Can it be added into spacy's standard pipeline as one more component option?

Also, I tried a few tests to see the similarity:

doc_city_trf = nlp_trf("Helsinki is the capital of Finland.")
doc_money_trf = nlp_trf("Filand has the capital of Helsinki.")
city_trf = doc_city_trf[3]
money_trf = doc_money_trf[3]
sim = city_trf.similarity(money_trf)

The similarity score is '0.78'. And I also want to see how similar each individual token is to the whole doc for the potential use of phrase extraction, those scores are extremely low:

for token in doc_city_trf:
    print(token.text, token.similarity(doc_city_trf))

Helsinki -0.042319824184486834
is 0.07403614803575871
the 0.018164927148046805
capital -0.037029484527426146
of 0.05153529803585432
Finland -0.030430182738838484
. 0.006196620965016796

Are these scores normal?

1 reply

adrianeboyd Jul 25, 2022

Part of the reason we don't current include anything like this is that this can be really task-dependent and in this case the context-sensitive tensors here aren't really suitable for using for this. They're fine-tuned for use in predicting tags/parses/NER and not for textual similarity.

If you want to do sentence similarity you're better off using something like sentence-transformers. I think it was their original paper shows that BERT is even worse than GloVe for certain similarity tasks. Ah, right, here it is: https://arxiv.org/pdf/1908.10084.pdf (see Table 1).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Best way to get contextual vector embeddings of tokens in a document #10361

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Best way to get contextual vector embeddings of tokens in a document #10361

Uh oh!

Uh oh!

phosseini Feb 23, 2022

Replies: 2 comments · 2 replies

Uh oh!

adrianeboyd Feb 24, 2022

Uh oh!

phosseini Feb 24, 2022 Author

Uh oh!

lingvisa Jul 22, 2022

Uh oh!

adrianeboyd Jul 25, 2022

phosseini
Feb 23, 2022

Replies: 2 comments 2 replies

adrianeboyd
Feb 24, 2022

phosseini Feb 24, 2022
Author

lingvisa
Jul 22, 2022