RAG selects the wrong node. #13862

theochatz96 · 2024-05-31T11:05:52Z

theochatz96
May 31, 2024

Hi all, I have the below code which is not working properly.. In particular I want to create a RAG chatbot that will be able to answer questions about specific drugs. The issue is that there are very similar drugs: (0.9% Sodium Chloride Intravenous Infusion Solution vs 0.9 % w/v Sodium Chloride Injection BP) and it almost always selects the wrong document to retrieve, leading to incorrect answers.. Can anyone help me? Thanks a lot!!

def get_doc_tools(
file_path: str,
name: str,
) -> str:
"""Get vector query tool from a document."""
class NodeMetadata(BaseModel):
"""Node Metadata."""
entities: List[str] = Field(
..., description="Unique entities in this text chunk."
)
summary: str = Field(
..., description="A concise summary of this text chunk."
)
medicinal_product: str = Field(
..., description="The name of the medicinal product."
)

openai_program = OpenAIPydanticProgram.from_defaults(
    output_cls=NodeMetadata,
    prompt_template_str="{input}",
)

program_extractor = PydanticProgramExtractor(
    program=openai_program, input_key="input", show_progress=True
)
from llama_index.core.ingestion import IngestionPipeline

documents = SimpleDirectoryReader(input_files=[file_path]).load_data()
splitter = SentenceSplitter(chunk_size=1500, chunk_overlap= 250)

pipeline = IngestionPipeline(
    transformations=[splitter, program_extractor]
)

nodes = pipeline.run(
documents=documents,
in_place=True,
show_progress=True,

)
vector_index = VectorStoreIndex.from_documents(
documents, transformations=[splitter, program_extractor]
)

def vector_query(query: str, medicinal_product: Optional[List[str]] = None) -> str:
    """Use to answer questions over a given paper.
    
    Useful if you have specific questions over the paper.
    Always leave medicinal_product_names as None UNLESS there is a specific product you want to search for.
    
    Args:
        query (str): the string query to be embedded.
        medicinal_product_names (Optional[List[str]]): Filter by set of medicinal product names. Leave as NONE 
            if it is not specified in the query. Otherwise, filter by the set of specified products."""
    
    medicinal_product = medicinal_product or []
    metadata_dicts = [{"key": "medicinal product", "value": p} for p in medicinal_product]
    medicinal_product_postprocessor = KeywordNodePostprocessor(
        required_keywords=medicinal_product
        )
    query_engine = vector_index.as_query_engine(
        similarity_top_k=10,
        # node_postprocessors=[my_reranker],
        filters=MetadataFilters.from_dicts(
            metadata_dicts,
            condition=FilterCondition.OR
        )
    )
    response = query_engine.query(query)
    
    return response
    

vector_query_tool = FunctionTool.from_defaults(
    name=f"vector_tool_{name}",
    fn=vector_query
)
return vector_query_tool

paper_to_tools_dict = {}
for paper in papers:
print(f"Getting tools for paper: {paper}")
vector_tool = get_doc_tools(paper, Path(paper).stem)
paper_to_tools_dict[paper] = [vector_tool]

all_tools = [t for paper in papers for t in paper_to_tools_dict[paper]]

print(len(all_tools))

obj_index = ObjectIndex.from_objects(
all_tools,
index_cls=VectorStoreIndex,
)

obj_retriever = obj_index.as_retriever(similarity_top_k=5)

tools = obj_retriever.retrieve(
"What is the shelf life of the medicinal product: '0.18%w/v Sodium Chloride and 4.0% w/v Glucose Intravenous Infusion BP'?"
)

from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

agent_worker = FunctionCallingAgentWorker.from_tools(
tool_retriever=obj_retriever,
llm=llm,
system_prompt = """
You are an agent designed to answer queries over a set of given papers.
Please always use the tools provided to answer a question. Do not rely on prior knowledge.

When answering a query, ensure that the 'medicinal_product' metadata in the node matches the medicinal product mentioned in the user's query.
If there is a match, proceed to provide the relevant information from the node.
If there is no match, indicate that the information about the requested medicinal product is not available in the provided papers AND look at another node that matches the requirement!.
""",
verbose=True
)
agent = AgentRunner(agent_worker)

response = agent.query(
"What is the shelf life of the medicinal product: '0.18%w/v Sodium Chloride and 4.0% w/v Glucose Intravenous Infusion BP'?"
)
print(str(response))

@theochatz96 · 2024-05-31T11:09:04Z

dosubot[bot]
bot May 31, 2024

Hello, @theochatz96! I'm here to assist you with any bugs, questions, or contributions you might have. Feel free to ask for help while waiting for a human maintainer.

To improve the accuracy of document selection for similar drug names in your RAG chatbot, you can use the SemanticSimilarityEvaluator class from the llama_index library. This class helps ensure that the selected documents are more accurate and relevant by comparing the similarity between embeddings of the generated answer and the reference answer.

Additionally, you can leverage the CorrectiveRAGPack to further enhance the evaluation process. This pack allows you to refine queries and evaluate the relevancy of retrieved documents effectively.

Here's how you can integrate these tools into your existing code:

Initialize the SemanticSimilarityEvaluator: Create an instance of the SemanticSimilarityEvaluator with the appropriate settings and similarity threshold.
Evaluate the Similarity: Use the aevaluate method to check if the response is similar to the reference answer.
Use CorrectiveRAGPack: Download and use the CorrectiveRAGPack to refine queries and evaluate document relevancy.

Here is an example code snippet to demonstrate this:

import asyncio
from llama_index.legacy.service_context import ServiceContext
from llama_index.legacy.evaluation.semantic_similarity import SemanticSimilarityEvaluator
from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.packs.corrective_rag.base import CorrectiveRAGPack
from llama_index.core import VectorStoreIndex

async def main():
    # Initialize the SemanticSimilarityEvaluator
    service_context = ServiceContext.from_defaults()
    evaluator = SemanticSimilarityEvaluator(service_context=service_context)

    # Define your query, response, and reference
    query = "What are the side effects of DrugX?"
    response = "DrugX can cause nausea and dizziness."
    reference = "Common side effects of DrugX include nausea and dizziness."

    # Evaluate the similarity
    evaluation_result = await evaluator.aevaluate(response=response, reference=reference)

    # Check the result
    if evaluation_result.passing:
        print("The response is similar to the reference.")
    else:
        print("The response is not similar to the reference.")

    # DOWNLOAD LLAMADATASET
    rag_dataset, documents = download_llama_dataset("MiniCovidQaDataset", "./data")

    # BUILD BASIC RAG PIPELINE
    index = VectorStoreIndex.from_documents(documents=documents)
    query_engine = index.as_query_engine()

    # Initialize CorrectiveRAGPack
    corrective_rag_pack = CorrectiveRAGPack(documents=documents, tavily_ai_apikey="your_tavily_api_key")

    # Run the CorrectiveRAGPack pipeline
    result = corrective_rag_pack.run(query_str=query)
    print(result)

if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

This approach will help ensure that the documents selected for similar drug names are more accurate and relevant to the user's query. The SemanticSimilarityEvaluator uses embedding similarity to determine the relevance, with a default similarity threshold of 0.8, which you can adjust based on your needs. Additionally, the CorrectiveRAGPack provides a comprehensive tool to refine queries and evaluate document relevancy, further improving your RAG pipeline ^[1]^[2]^[3]^[4]^[5].

To continue talking to Dosu, mention @dosu.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RAG selects the wrong node. #13862

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RAG selects the wrong node. #13862

Uh oh!

theochatz96 May 31, 2024

Replies: 1 comment

Uh oh!

dosubot[bot] bot May 31, 2024

theochatz96
May 31, 2024

dosubot[bot]
bot May 31, 2024