Replies: 1 comment 2 replies
-
I found a similar solved discussion on how to check if a document exists before inserting it into the index. You can use the Here is the relevant code snippet: def _handle_upserts(
self,
nodes: Sequence[BaseNode],
store_doc_text: bool = True,
) -> Sequence[BaseNode]:
"""Handle docstore upserts by checking hashes and ids."""
assert self.docstore is not None
doc_ids_from_nodes = set()
deduped_nodes_to_run = {}
for node in nodes:
ref_doc_id = node.ref_doc_id if node.ref_doc_id else node.id_
doc_ids_from_nodes.add(ref_doc_id)
existing_hash = self.docstore.get_document_hash(ref_doc_id)
if not existing_hash:
# document doesn't exist, so add it
deduped_nodes_to_run[ref_doc_id] = node
elif existing_hash and existing_hash != node.hash:
self.docstore.delete_ref_doc(ref_doc_id, raise_error=False)
if self.vector_store is not None:
self.vector_store.delete(ref_doc_id)
deduped_nodes_to_run[ref_doc_id] = node
else:
continue # document exists and is unchanged, so skip it
nodes_to_run = list(deduped_nodes_to_run.values())
self.docstore.add_documents(nodes_to_run, store_text=store_doc_text)
return nodes_to_run This method ensures that the document store is checked for existing document IDs and hashes before any new documents are added or updated [1][2]. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
how could we check if a doc_id already exists before inserting via IngestionPipeline?
Beta Was this translation helpful? Give feedback.
All reactions