Duplicate documents based on content and meta #3365
-
Hey Team, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @JaisVJ , when you set If you have two documents and they have the exact same content in meta (and content), then they are duplicates. If the two documents differ in some values stored in meta, then they are no duplicates. Or, in your use case, could it be that two documents differ in some of their metadata values but they should still be treated as duplicates? Right now, id_hash_keys can only be a subset of Line 125 in e2e6887 Therefore, it is unfortunately not possible to provide a particular field from meta to id_hash_keys and I am not aware of any other way to classify records as duplicates. Here is the part of the code where we drop duplicates based on their id, if you are interested: haystack/haystack/document_stores/base.py Line 616 in b10e2c3 As a side note: there is an open issue on id_hash_keys not working as expected in some cases: #3236 |
Beta Was this translation helpful? Give feedback.
Hi @JaisVJ , when you set
id_hash_keys=['content','meta']
, Haystack will create a hash of the content and everything that is stored in meta, which I believe works for use case, no? Does selecting a subset of the data stored in meta make a difference for you?If you have two documents and they have the exact same content in meta (and content), then they are duplicates. If the two documents differ in some values stored in meta, then they are no duplicates. Or, in your use case, could it be that two documents differ in some of their metadata values but they should still be treated as duplicates?
Right now, id_hash_keys can only be a subset of
[content, content_type, id, score, meta, embedding]
…