Avoiding duplicated embeddings in vector stores (Pinecone, etc...) #266
-
Hi First of all: great project. As far as I understood, each time I run a flow, that uses a vector store - it creates and stores additional embeddings. And doesn't check for duplicates. Any idea / concept how to solve this? In my previous projects (using langflow) I used also separate functions for aggregation -> index / embeddings -> retrieve / chat Also found this thread at langchain-ai/langchain#2699 regarding "updating existing documents" |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
Beta Was this translation helpful? Give feedback.
-
Thanks for your answer. Still the issue of "updating" an existing index is still there. Seems to be rooted in langchain library itself. First issue (with Pinecone) is, that it doesn't separate documents from embeddings (text chunks) - this seems to be a problem for many implementations. |
Beta Was this translation helpful? Give feedback.
-
I'm seeing a similar issue with Faiss index. Seems like the upsert should check the timestamp on the input file, and the timestamp on the embeddings file, and only run embeddings if the file is newer than the index. Is this something Flowise can handle? Or is this beneath Flowise at the Langchain layer. One nice thing about the solution that Henry proposed is the ability to generate the index file outside of Flowise. I have one customer example with 100K products and it takes hours to run the embeddings and sometimes you get timeouts and need to restart. In this way, creating the Faiss index file or Pinecone index outside of Flowise is a nice capability. |
Beta Was this translation helpful? Give feedback.
Currently best way is to separate into 2 flows.
1.) Upsert flow (create index):
2.) Load existing index
However, for the upsert flow, documents will only get upserted once when user start asking question (only the first time), it wont upsert again whenever user ask another question. ONLY when the flow configuration (like different file, different openapi, different pinecone index, etc) got changed, and you have to save it again, then another upsert will be done