Using Clustering to retrieve similar documents #23950
Kirushikesh
announced in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Checked
Feature request
Implement a clustering-based retrieval method for RAG pipelines. This feature would use clustering algorithms like DBSCAN to group document embeddings, allowing for retrieval based on cluster proximity rather than direct vector similarity.
Motivation
Current RAG implementations in LangChain rely on vector similarity search, which may become inefficient for large datasets. A clustering-based approach could potentially improve retrieval speed and provide thematic grouping of documents. This feature would be particularly useful for applications dealing with large-scale document retrieval or topic-based information retrieval.
Proposal (If applicable)
Introduce a new retriever class that:
a) Clusters document embeddings using DBSCAN or similar algorithms during indexing.
b) For queries, finds the nearest cluster(s) to the query embedding.
c) Retrieves top-k documents from the identified cluster(s).
d) Optionally implements a hybrid approach, using clustering for initial filtering and vector similarity for final ranking within clusters
Beta Was this translation helpful? Give feedback.
All reactions