metadata filtering with BM25 Retriver #13978

VijayKumaraswamy · 2024-06-06T12:08:12Z

VijayKumaraswamy
Jun 6, 2024

import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-lvksfbvhskvbdksvbksdfvbkfdsvhkb"
openai.api_key = os.environ["OPENAI_API_KEY"]

import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext

chroma_client = chromadb.HttpClient(host='gffjd', port=8000)

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

loads BAAI/bge-small-en-v1.5

embed_model = HuggingFaceEmbedding(model_name="all-MiniLM-L6-v2")

from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

from llama_index.core.vector_stores import (
MetadataFilter,
MetadataFilters,
FilterOperator,
)

filters = MetadataFilters(
filters=[
MetadataFilter(
key="SAMSUNG", value="10-10-2019"
),
]
)

from llama_index.core import VectorStoreIndex

chroma_collection = chroma_client.get_collection("bank_mandate_forms_with_company_date_to_keys_summary_in_content")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store,
storage_context=storage_context, embed_model=embed_model
)

from llama_index.retrievers.bm25 import BM25Retriever

vector_retriever = index.as_retriever(similarity_top_k=2, filters=filters)

source_nodes = index.as_retriever(similarity_top_k=100, filters=filters).retrieve(' ')
nodes = [x.node for x in source_nodes]
bm25_retriever = BM25Retriever.from_defaults(nodes=nodes)

from llama_index.core.retrievers import QueryFusionRetriever

retriever = QueryFusionRetriever(
[vector_retriever, bm25_retriever],
similarity_top_k=2,
num_queries=4, # set this to 1 to disable query generation
mode="reciprocal_rerank",
use_async=True,
verbose=True,
# query_gen_prompt="...", # we could override the query generation prompt here
)

apply nested async to run in a notebook

import nest_asyncio

nest_asyncio.apply()

To the above code, I am trying to add metadata filters to both retrievers. while doing that i am having this error, can you help me with this?

{
"name": "ValueError",
"message": "Please pass exactly one of index, nodes, or docstore.",
"stack": "---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[13], line 7
5 source_nodes = index.as_retriever(filters=filters).retrieve(' ')
6 nodes = [x.node for x in source_nodes]
----> 7 bm25_retriever = BM25Retriever.from_defaults(nodes=nodes)

File c:\Users\VijayKumaraswamyLagu\Documents\RAG\Gen_AI_env\Lib\site-packages\llama_index\retrievers\bm25\base.py:60, in BM25Retriever.from_defaults(cls, index, nodes, docstore, tokenizer, similarity_top_k, verbose)
48 @classmethod
49 def from_defaults(
50 cls,
(...)
57 ) -> "BM25Retriever":
58 # ensure only one of index, nodes, or docstore is passed
59 if sum(bool(val) for val in [index, nodes, docstore]) != 1:
---> 60 raise ValueError("Please pass exactly one of index, nodes, or docstore.")
62 if index is not None:
63 docstore = index.docstore

ValueError: Please pass exactly one of index, nodes, or docstore."
}

@VijayKumaraswamy · 2024-06-06T12:42:23Z

dosubot[bot]
bot Jun 6, 2024

Hello, @VijayKumaraswamy! I'm here to help you with any issues you're facing. Let's tackle this problem together!

To resolve the "Please pass exactly one of index, nodes, or docstore" error and correctly add metadata filters to both retrievers in your code, you need to ensure that you are passing only one of the required parameters (index, nodes, or docstore) when creating the BM25Retriever. Additionally, you need to add metadata filters to the retrievers.

Here is an updated version of your code with the necessary changes:

import os
import openai
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters, FilterOperator
from llama_index.core import VectorStoreIndex
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
import nest_asyncio

# Set up OpenAI API key
os.environ["OPENAI_API_KEY"] = "sk-lvksfbvhskvbdksvbksdfvbkfdsvhkb"
openai.api_key = os.environ["OPENAI_API_KEY"]

# Initialize Chroma client
chroma_client = chromadb.HttpClient(host='ab7a73436a8c04c269e8d27e651d3fea-389708969.ap-south-1.elb.amazonaws.com', port=8000)

# Load embedding model
embed_model = HuggingFaceEmbedding(model_name="all-MiniLM-L6-v2")

# Initialize OpenAI LLM
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

# Define metadata filters
filters = MetadataFilters(
    filters=[
        MetadataFilter(
            key="SAMSUNG", value="10-10-2019"
        ),
    ]
)

# Set up Chroma vector store and storage context
chroma_collection = chroma_client.get_collection("bank_mandate_forms_with_company_date_to_keys_summary_in_content")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create VectorStoreIndex
index = VectorStoreIndex.from_vector_store(vector_store=vector_store, storage_context=storage_context, embed_model=embed_model)

# Create vector retriever with metadata filters
vector_retriever = index.as_retriever(similarity_top_k=2, filters=filters)

# Retrieve nodes with metadata filters
source_nodes = vector_retriever.retrieve(' ')
nodes = [x.node for x in source_nodes]

# Create BM25 retriever with nodes
bm25_retriever = BM25Retriever.from_defaults(nodes=nodes)

# Create QueryFusionRetriever with both retrievers
retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    similarity_top_k=2,
    num_queries=4,  # set this to 1 to disable query generation
    mode="reciprocal_rerank",
    use_async=True,
    verbose=True,
    # query_gen_prompt="...",  # we could override the query generation prompt here
)

# Apply nested async to run in a notebook
nest_asyncio.apply()

In this code:

Metadata filters are added to the vector_retriever.
Nodes are retrieved using the vector_retriever with the specified filters.
The BM25Retriever is created using the retrieved nodes.
Both retrievers are combined in the QueryFusionRetriever.

This should resolve the error and correctly apply metadata filters to both retrievers ^[1].

To continue talking to Dosu, mention @dosu.

0 replies

VijayKumaraswamy · 2024-06-06T13:30:47Z

VijayKumaraswamy
Jun 6, 2024
Author

Thank you so much for the response.

source_nodes = vector_retriever.retrieve(' ')

what does this line do, can you please elaborate it for me?

1 reply

dosubot[bot] bot Jun 6, 2024

The line

source_nodes = vector_retriever.retrieve(' ')

is calling the retrieve method of the VectorIndexRetriever class with a query string consisting of a single space character. This method retrieves a list of nodes (documents) from the vector store that match the given query.

Here's a detailed breakdown of what happens:

Query Processing: The retrieve method processes the input query string. In this case, the query string is a single space character.
Query Bundle Construction: Internally, the method constructs a QueryBundle from the input query string. If the vector store is configured for embedding queries, it ensures that the query bundle has the necessary embeddings.
Vector Store Query: The method then uses these embeddings to build a VectorStoreQuery, which includes parameters such as similarity_top_k, node_ids, doc_ids, query_str, mode, alpha, filters, and sparse_top_k.
Execution: The VectorStoreQuery is executed against the vector store. The results are processed to build a list of NodeWithScore objects, which represent the retrieved nodes along with their relevance scores.
Handling Empty Queries: If the query string is empty or consists of only whitespace, the VectorIndexAutoRetriever class can handle this by using a default empty query vector if configured. This ensures that even with an empty query, relevant nodes can still be retrieved based on metadata filters and other configurations.

The retrieved nodes are stored in the source_nodes variable, which can then be used for further processing or analysis.

This process ensures that the retrieval is robust and can handle various types of queries, including those with metadata filters ^[1]^[2]^[3]^[4]^[5].

To continue talking to Dosu, mention @dosu.

VijayKumaraswamy · 2024-06-07T04:10:13Z

VijayKumaraswamy
Jun 7, 2024
Author

The updated code is still giving the same error

ValueError Traceback (most recent call last)
Cell In[12], line 11
8 nodes = [x.node for x in source_nodes]
10 # Create BM25 retriever with nodes
---> 11 bm25_retriever = BM25Retriever.from_defaults(nodes=nodes)

File c:\Users\VijayKumaraswamyLagu\Documents\RAG\Gen_AI_env\Lib\site-packages\llama_index\retrievers\bm25\base.py:60, in BM25Retriever.from_defaults(cls, index, nodes, docstore, tokenizer, similarity_top_k, verbose)
48 @classmethod
49 def from_defaults(
50 cls,
(...)
57 ) -> "BM25Retriever":
58 # ensure only one of index, nodes, or docstore is passed
59 if sum(bool(val) for val in [index, nodes, docstore]) != 1:
---> 60 raise ValueError("Please pass exactly one of index, nodes, or docstore.")
62 if index is not None:
63 docstore = index.docstore

ValueError: Please pass exactly one of index, nodes, or docstore.

1 reply

dosubot[bot] bot Jun 7, 2024

To resolve the "Please pass exactly one of index, nodes, or docstore" error, ensure that only one of the parameters (index, nodes, or docstore) is passed to the from_defaults method of the BM25Retriever class. Here is an updated version of your code:

import os
import openai
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters, FilterOperator
from llama_index.core import VectorStoreIndex
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
import nest_asyncio

# Set up OpenAI API key
os.environ["OPENAI_API_KEY"] = "sk-lvksfbvhskvbdksvbksdfvbkfdsvhkb"
openai.api_key = os.environ["OPENAI_API_KEY"]

# Initialize Chroma client
chroma_client = chromadb.HttpClient(host='ab7a73436a8c04c269e8d27e651d3fea-389708969.ap-south-1.elb.amazonaws.com', port=8000)

# Load embedding model
embed_model = HuggingFaceEmbedding(model_name="all-MiniLM-L6-v2")

# Initialize OpenAI LLM
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

# Define metadata filters
filters = MetadataFilters(
    filters=[
        MetadataFilter(
            key="SAMSUNG", value="10-10-2019"
        ),
    ]
)

# Set up Chroma vector store and storage context
chroma_collection = chroma_client.get_collection("bank_mandate_forms_with_company_date_to_keys_summary_in_content")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create VectorStoreIndex
index = VectorStoreIndex.from_vector_store(vector_store=vector_store, storage_context=storage_context, embed_model=embed_model)

# Create vector retriever with metadata filters
vector_retriever = index.as_retriever(similarity_top_k=2, filters=filters)

# Retrieve nodes with metadata filters
source_nodes = vector_retriever.retrieve(' ')
nodes = [x.node for x in source_nodes]

# Create BM25 retriever with nodes
bm25_retriever = BM25Retriever.from_defaults(nodes=nodes)

# Create QueryFusionRetriever with both retrievers
retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    similarity_top_k=2,
    num_queries=4,  # set this to 1 to disable query generation
    mode="reciprocal_rerank",
    use_async=True,
    verbose=True,
    # query_gen_prompt="...",  # we could override the query generation prompt here
)

# Apply nested async to run in a notebook
nest_asyncio.apply()

In this code:

Metadata filters are added to the vector_retriever.
Nodes are retrieved using the vector_retriever with the specified filters.
The BM25Retriever is created using the retrieved nodes.

Ensure that you are not passing multiple parameters (index, nodes, or docstore) to the BM25Retriever.from_defaults method to avoid the error ^[1].

To continue talking to Dosu, mention @dosu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metadata filtering with BM25 Retriver #13978

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

metadata filtering with BM25 Retriver #13978

Uh oh!

Uh oh!

VijayKumaraswamy Jun 6, 2024

loads BAAI/bge-small-en-v1.5

apply nested async to run in a notebook

Replies: 3 comments · 2 replies

Uh oh!

dosubot[bot] bot Jun 6, 2024

Uh oh!

VijayKumaraswamy Jun 6, 2024 Author

Uh oh!

dosubot[bot] bot Jun 6, 2024

Uh oh!

VijayKumaraswamy Jun 7, 2024 Author

Uh oh!

dosubot[bot] bot Jun 7, 2024

VijayKumaraswamy
Jun 6, 2024

Replies: 3 comments 2 replies

dosubot[bot]
bot Jun 6, 2024

VijayKumaraswamy
Jun 6, 2024
Author

VijayKumaraswamy
Jun 7, 2024
Author