please include hybrid search in MultivectorRetriver by langchain for accurate multimodal rag #30698

mahendra867 · 2025-04-06T20:54:08Z

mahendra867
Apr 6, 2025

Checked

I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it

Feature request

🔧 Feature Request
Title: Support Hybrid Search in MultiVectorRetriever

Description:
Please add native support for hybrid search in MultiVectorRetriever by enabling compatibility with retrievers such as PineconeHybridSearchRetriever. Currently, hybrid search retrievers do not subclass LangChain's VectorStore, which results in incompatibility errors when attempting to pass them to MultiVectorRetriever.

Relevant Links:

LangChain MultiVectorRetriever Docs

Pinecone Hybrid Search Retriever

LangChain VectorStore Interface

Motivation

💡 Motivation
In real-world multimodal Retrieval-Augmented Generation (RAG) systems, we often summarize different modalities like text, tables, and images to vector representations. These summaries benefit significantly from hybrid search, which combines dense embeddings with sparse retrieval like BM25. However, the current MultiVectorRetriever in LangChain only supports vector stores that subclass VectorStore, excluding hybrid search retrievers like PineconeHybridSearchRetriever.

This makes it difficult to build hybrid multimodal RAG systems without custom patches or wrappers.

I'm always frustrated when I try to pass a hybrid search retriever to MultiVectorRetriever and encounter compatibility errors, despite it having all the required methods.

Proposal (If applicable)

import os
from dotenv import load_dotenv

from langchain_community.retrievers import (
PineconeHybridSearchRetriever,
)

from langchain_openai import AzureOpenAIEmbeddings

from pinecone import Pinecone, ServerlessSpec
from pinecone_text.sparse import BM25Encoder

load_dotenv()

def create_retriever(text, text_summary, table, table_summary, image, image_summary):

embeddings = AzureOpenAIEmbeddings(
azure_deployment="text-embedding-ada-002",  # Your deployment name in Azure
model="text-embedding-ada-002",             # Optional but recommended
azure_endpoint=AZURE_OPENAI_ENDPOINT,
openai_api_version="",
api_key=AZURE_OPENAI_API_KEY,
                
)





# Load from env or directly define
pinecone_api = os.getenv("pinecone_api")
os.environ['pinecone_api']= pinecone_api


pc = Pinecone(api_key=pinecone_api)

index_name = "langchain-pinecone-hybrid-search"


# create the index
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,  # dimensionality of dense model
        metric="dotproduct",  # sparse values supported only for dotproduct
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )

index = pc.Index(index_name)


# Step 3: Fit BM25 on your summaries
summaries_corpus = text_summary + table_summary + image_summary
bm25_encoder = BM25Encoder().default()
bm25_encoder.fit(summaries_corpus)

# Optional: Save and reload BM25 (if you want persistence)
bm25_encoder.dump("bm25_values.json")
bm25_encoder = BM25Encoder().load("bm25_values.json")

vectorstore = PineconeHybridSearchRetriever(
embeddings=embeddings, sparse_encoder=bm25_encoder, index=index
)

# Initialize vector store and in-memory store
"""vectorstore = Chroma(collection_name="multi_modal_rag_neuyysw", embedding_function=embeddings, persist_directory="./chroma_db")"""
store = InMemoryStore()
id_key = "doc_id"

# Create multi-vector retriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key
)

# Helper function to add documents to the retriever
def add_documents_to_retriever(documents, summaries, retriever):
    if summaries:
        doc_ids = [str(uuid.uuid4()) for _ in documents]
        summary_docs = [
            Document(page_content=summary, metadata={id_key: doc_ids[i]}) 
            for i, summary in enumerate(summaries)
        ]
        retriever.vectorstore.add_documents(summary_docs)
        retriever.docstore.mset(list(zip(doc_ids, documents)))

# Add text, table, and image summaries to the retriever
add_documents_to_retriever(text, text_summary, retriever)
add_documents_to_retriever(table, table_summary, retriever)
add_documents_to_retriever(image, image_summary, retriever)

return retriever

This fails because PineconeHybridSearchRetriever is not a subclass of VectorStore. But it behaves like one and supports methods like add_documents and similarity_search.

✅ Proposed Solution
Allow MultiVectorRetriever to optionally accept any retriever-like object that implements the add_documents and similarity_search interface.

Alternatively, create a wrapper or adapter that conforms hybrid retrievers to the VectorStore interface.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

please include hybrid search in MultivectorRetriver by langchain for accurate multimodal rag #30698

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

please include hybrid search in MultivectorRetriver by langchain for accurate multimodal rag #30698

Uh oh!

mahendra867 Apr 6, 2025

Checked

Feature request

Motivation

Proposal (If applicable)

Replies: 0 comments

mahendra867
Apr 6, 2025