Embedding Postprocessing? #10174

gzguevara · 2024-01-21T18:18:11Z

gzguevara
Jan 21, 2024

I am using HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2"). When I look at the raw number, they do not correspond with what I see in default__vector_store.json

Is this the expected behaviour? does llama_index use some post processing for the embeddings?

Answered by dosubot[bot]

Jan 21, 2024

🤖

Hey @gzguevara, hope you're doing well! Good to see you again. We're diving into another issue, let's get started.

Yes, the behavior you're observing is expected. LlamaIndex does apply post-processing to the embeddings from HuggingFaceEmbedding. The raw numbers you see do not match with what is seen in default__vector_store.json due to the following post-processing steps:

Pooling: Depending on the pooling attribute of the HuggingFaceEmbedding class, the embeddings are pooled. If pooling is set to Pooling.CLS, the embeddings are pooled using the cls_pooling method of the Pooling class. If pooling is not set to Pooling.CLS, the _mean_pooling method is used to pool the embeddings. This m…

View full answer

@gzguevara · 2024-01-21T18:19:03Z

dosubot[bot]
bot Jan 21, 2024

🤖

Hey there, @gzguevara! Good to see you back here. Don't worry, I'm diving into your issue and will get back to you with a detailed response soon. Hang tight!

0 replies

@gzguevara · 2024-01-21T18:19:32Z

dosubot[bot]
bot Jan 21, 2024

🤖

Hey @gzguevara, hope you're doing well! Good to see you again. We're diving into another issue, let's get started.

Yes, the behavior you're observing is expected. LlamaIndex does apply post-processing to the embeddings from HuggingFaceEmbedding. The raw numbers you see do not match with what is seen in default__vector_store.json due to the following post-processing steps:

Pooling: Depending on the pooling attribute of the HuggingFaceEmbedding class, the embeddings are pooled. If pooling is set to Pooling.CLS, the embeddings are pooled using the cls_pooling method of the Pooling class. If pooling is not set to Pooling.CLS, the _mean_pooling method is used to pool the embeddings. This method takes into account the attention mask for correct averaging.
Normalization: If the normalize attribute of the HuggingFaceEmbedding class is set to True, the embeddings are normalized using the torch.nn.functional.normalize function with p=2 and dim=1.

These post-processing steps are implemented in the _embed method of the HuggingFaceEmbedding class. You can find the relevant code in the llama_index/embeddings/huggingface.py file.

The pooling attribute in the HuggingFaceEmbedding class determines the strategy used to pool the embeddings generated by the HuggingFace model. Pooling is a technique used to reduce the dimensionality of the model's output, which is especially useful when the model generates a sequence of embeddings (one for each token in the input) and you want a single vector representation for the entire input.

If pooling is set to Pooling.CLS, the cls_pooling method of the Pooling enum is used to pool the embeddings. This method typically takes the embedding of the special [CLS] token as the representation of the entire input.

If pooling is not set to Pooling.CLS, the _mean_pooling method of the HuggingFaceEmbedding class is used. This method computes the mean of the token embeddings, taking the attention mask into account for correct averaging.

I hope this clarifies your question. If you have any further questions, feel free to ask.

Best,
Dosu

Sources

llama_index/embeddings/huggingface.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Embedding Postprocessing? #10174

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Embedding Postprocessing? #10174

Uh oh!

gzguevara Jan 21, 2024

Replies: 2 comments

Uh oh!

dosubot[bot] bot Jan 21, 2024

Uh oh!

Uh oh!

dosubot[bot] bot Jan 21, 2024

Sources

gzguevara
Jan 21, 2024

dosubot[bot]
bot Jan 21, 2024

dosubot[bot]
bot Jan 21, 2024