Efficient Document Similarity Search using DocBin and Annoy/Faiss #8865
Replies: 1 comment
-
I don't think the DocBin is a good fit for this use case. The format is designed to be able to store documents on disc compactly, but one tradeoff it makes to achieve that is that you can't pull individual documents out of it - if you want to get one document, you have to deserialize the whole DocBin. The docs mention this a bit. For search style applications you generally want a database with document IDs, raw text, and vector representations. I think normally you wouldn't actually need the spaCy doc object, though you could depending on what you want to do. Depending on your use case a sqlite DB would be fine for trying this out. So for your questions:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
On my hands I have a problem of trying to find the N most similar documents from a corpus to certain document outside the corpus.
I am trying to figure out the most efficient and accurate way to deal with the problem.
Before the search I create all the embeddings, I would like this to be as efficient as possible so I would like to use the DocBin class from spaCy. Each doc will also have an ID as attribute.
My main questions:
How can I select a subset from the previous computed DocBin based on ID? Let's say I just want the docs that have the ids in a given list.
How can I pass a DocBin with around 2000 docs to Annoy for instance and get as return the ids of the N most similar docs.
Finally, If I want to update my main corpus, not only by adding content but also substitute some. Let's say I want to add the Doc associated with ID = K how can I change de DocBin?
Beta Was this translation helpful? Give feedback.
All reactions