Best way to get contextual vector embeddings of tokens in a document #10361
-
This is what I am trying to do and I appreciate any thoughts on what would be the best practice to achieve that. I have a collection of documents. And for every document, I need to compute some metrics. To compute some of these metrics, I need to have the vector representation/embeddings (ideally I want the word embeddings from Transformers since I want to compute contextual vector embeddings). I may also need to filter out tokens based on their Part of Speech Tags (POS). So I'm trying to figure what the best way is to have the POS tags and vector representation of tokens. So here is basically the input and output: Input: One point to consider is that depending on what tokenizer is going to be used, we know that there is a chance a word is split into sub-tokens each with its own word embedding vector. Ideally, I don't want to deal with pooling the vectors and I want one single vector embedding for each token with a POS tag in the document. Currently, I've tried the following code:
However, there are two issues here: 1) as far as I understand Also, I noticed that for each UPDATE: I just came across this useful discussion and realized that we have a |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
This is a good overview and includes a section on embeddings from trf models: https://applied-language-technology.mooc.fi/html/notebooks/part_iii/05_embeddings_continued.html |
Beta Was this translation helpful? Give feedback.
-
This looks good. Can it be added into spacy's standard pipeline as one more component option? Also, I tried a few tests to see the similarity:
The similarity score is '0.78'. And I also want to see how similar each individual token is to the whole doc for the potential use of phrase extraction, those scores are extremely low:
Are these scores normal? |
Beta Was this translation helpful? Give feedback.
This is a good overview and includes a section on embeddings from trf models: https://applied-language-technology.mooc.fi/html/notebooks/part_iii/05_embeddings_continued.html