Duplicate documents based on content and meta #3365

jais001 · 2022-10-12T07:54:29Z

jais001
Oct 12, 2022

Hey Team,
I am using Haystack with ElasticsearchDocumentStore and EmbeddingRetriever in my application, I have a question related to duplicate_documents = "overwrite". Based on my understanding, I came to the conclusion that records are classified as duplicates based on id_hash_keys of Document and the values for id_hash_keys can be a list containing content as well as meta. My metadata will have multiple fields say solution and weightage, is it possible to provide a particular field from meta to id_hash_keys? If not, is there a way to classify records as duplicates based on content as well as some fields from the meta?

Answered by julian-risch

Oct 13, 2022

Hi @JaisVJ , when you set id_hash_keys=['content','meta'], Haystack will create a hash of the content and everything that is stored in meta, which I believe works for use case, no? Does selecting a subset of the data stored in meta make a difference for you?

If you have two documents and they have the exact same content in meta (and content), then they are duplicates. If the two documents differ in some values stored in meta, then they are no duplicates. Or, in your use case, could it be that two documents differ in some of their metadata values but they should still be treated as duplicates?

Right now, id_hash_keys can only be a subset of [content, content_type, id, score, meta, embedding]…

View full answer

julian-risch · 2022-10-13T09:33:09Z

julian-risch
Oct 13, 2022
Maintainer

Hi @JaisVJ , when you set id_hash_keys=['content','meta'], Haystack will create a hash of the content and everything that is stored in meta, which I believe works for use case, no? Does selecting a subset of the data stored in meta make a difference for you?

If you have two documents and they have the exact same content in meta (and content), then they are duplicates. If the two documents differ in some values stored in meta, then they are no duplicates. Or, in your use case, could it be that two documents differ in some of their metadata values but they should still be treated as duplicates?

Right now, id_hash_keys can only be a subset of [content, content_type, id, score, meta, embedding] because of the way we extract the field's value from a document with getattr here:

haystack/haystack/schema.py

Line 125 in e2e6887

final_hash_key += ":" + str(getattr(self, attr))

Therefore, it is unfortunately not possible to provide a particular field from meta to id_hash_keys and I am not aware of any other way to classify records as duplicates. Here is the part of the code where we drop duplicates based on their id, if you are interested:

haystack/haystack/document_stores/base.py

Line 616 in b10e2c3

    
           def _drop_duplicate_documents(self, documents: List[Document], index: Optional[str] = None) -> List[Document]:

As a side note: there is an open issue on id_hash_keys not working as expected in some cases: #3236

1 reply

jais001 Oct 13, 2022
Author

Thank you @julian-risch !! Yeah, in my use case, some particular fields from metadata will differ and we need to classify documents as duplicate or not based on that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Duplicate documents based on content and meta #3365

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Duplicate documents based on content and meta #3365

Uh oh!

jais001 Oct 12, 2022

Replies: 1 comment · 1 reply

Uh oh!

julian-risch Oct 13, 2022 Maintainer

Uh oh!

Uh oh!

jais001 Oct 13, 2022 Author

jais001
Oct 12, 2022

Replies: 1 comment 1 reply

julian-risch
Oct 13, 2022
Maintainer

jais001 Oct 13, 2022
Author