Using Clustering to retrieve similar documents #23950

Kirushikesh · 2024-07-07T15:33:30Z

Kirushikesh
Jul 7, 2024

Checked

I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it

Feature request

Implement a clustering-based retrieval method for RAG pipelines. This feature would use clustering algorithms like DBSCAN to group document embeddings, allowing for retrieval based on cluster proximity rather than direct vector similarity.

Motivation

Current RAG implementations in LangChain rely on vector similarity search, which may become inefficient for large datasets. A clustering-based approach could potentially improve retrieval speed and provide thematic grouping of documents. This feature would be particularly useful for applications dealing with large-scale document retrieval or topic-based information retrieval.

Proposal (If applicable)

Introduce a new retriever class that:
a) Clusters document embeddings using DBSCAN or similar algorithms during indexing.
b) For queries, finds the nearest cluster(s) to the query embedding.
c) Retrieves top-k documents from the identified cluster(s).
d) Optionally implements a hybrid approach, using clustering for initial filtering and vector similarity for final ranking within clusters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using Clustering to retrieve similar documents #23950

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Using Clustering to retrieve similar documents #23950

Uh oh!

Kirushikesh Jul 7, 2024

Checked

Feature request

Motivation

Proposal (If applicable)

Replies: 0 comments

Kirushikesh
Jul 7, 2024