Efficient Document Similarity Search using DocBin and Annoy/Faiss #8865

filipematos95 · 2021-08-02T17:20:35Z

filipematos95
Aug 2, 2021

On my hands I have a problem of trying to find the N most similar documents from a corpus to certain document outside the corpus.

I am trying to figure out the most efficient and accurate way to deal with the problem.

Before the search I create all the embeddings, I would like this to be as efficient as possible so I would like to use the DocBin class from spaCy. Each doc will also have an ID as attribute.

My main questions:

How can I select a subset from the previous computed DocBin based on ID? Let's say I just want the docs that have the ids in a given list.
How can I pass a DocBin with around 2000 docs to Annoy for instance and get as return the ids of the N most similar docs.
Finally, If I want to update my main corpus, not only by adding content but also substitute some. Let's say I want to add the Doc associated with ID = K how can I change de DocBin?

polm · 2021-08-03T06:25:50Z

polm
Aug 3, 2021

I don't think the DocBin is a good fit for this use case. The format is designed to be able to store documents on disc compactly, but one tradeoff it makes to achieve that is that you can't pull individual documents out of it - if you want to get one document, you have to deserialize the whole DocBin. The docs mention this a bit.

For search style applications you generally want a database with document IDs, raw text, and vector representations. I think normally you wouldn't actually need the spaCy doc object, though you could depending on what you want to do. Depending on your use case a sqlite DB would be fine for trying this out.

So for your questions:

There is no good way to do this - you'd have to iterate over every file in the DocBin and return the doc when you find it.
You can't pass a DocBin to Annoy. You need to get a vector representation of a doc, pass that to Annoy, and use the IDs Annoy returns to get matching docs.
Once you have loaded a serialized DocBin you should be able to just call add on it as usual.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Efficient Document Similarity Search using DocBin and Annoy/Faiss #8865

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Efficient Document Similarity Search using DocBin and Annoy/Faiss #8865

Uh oh!

filipematos95 Aug 2, 2021

Replies: 1 comment

Uh oh!

polm Aug 3, 2021

filipematos95
Aug 2, 2021

polm
Aug 3, 2021