-
Notifications
You must be signed in to change notification settings - Fork 211
Open
Description
How would I best do pure cos sim based retrival with precalculated Image and Sentence Vectors (say vector databases)?
(The example Notebooks dont realy show that)
In my current case, I am able to retrieve similar Images given other Image embeddings, but Word=>Image does not work at all
(is it even supposed to before applying logit scale and bias?)
Aka when looking at the Transformers Implementation I follow to extract Vectors
# normalized features
image_embeds = image_embeds / image_embeds.norm(p=2, dim=-1, keepdim=True)
text_embeds = text_embeds / text_embeds.norm(p=2, dim=-1, keepdim=True)
# cosine similarity as logits
logits_per_text = torch.matmul(text_embeds, image_embeds.t().to(text_embeds.device))
logit_scale, logit_bias = self.logit_scale.to(text_embeds.device), self.logit_bias.to(text_embeds.device)
logits_per_text = logits_per_text * logit_scale.exp() + logit_bias
At the moment I save just the "image_embeds" and "text_embeds" and try to do retrieval with them, but scores are quite low, especially compared to CLIP baseline.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels