Chat bot with Pandas Dataframe Agent - Need exact values while filtering the values. How I mitigate the problem? #26012

mraguth · 2024-09-04T04:27:56Z

mraguth
Sep 4, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

from langchain_experimental.agents import create_pandas_dataframe_agent
import os
import pandas as pd
agent = create_pandas_dataframe_agent(
    llm,
    df,
    agent_type="openai-tools",
    verbose=True,
    return_intermediate_steps=True,
    allow_dangerous_code=True,
    allow_dangerous_requests=True,
)
agent.invoke({"input":query_str})

Description

I created a chatbot using the create_pandas_dataframe_agent function. Under the hood, I provided a pandas DataFrame, and the pandas agent performs well as long as users ask general questions without requiring filters. For example, a filter question might be: "What is the salary of Bob?" Here, "Bob" is a value in the 'Employee Name' column.

The issue arises because my end users may not know the exact name format. Sometimes they type "bob," "Bob," or "BOB," leading to the agent responding with "Not found in the DataFrame."

My question is: Can we convert the pandas DataFrame to an unstructured format and use Retrieval-Augmented Generation (RAG) on top of it? Would this provide accurate results? If I convert it to RAG, will it handle computations well if users ask questions based on aggregations?

This is my sample dataset. Is converting it to an unstructured format a viable solution? I tried converting it to JSON format but didn't have much success. I am open to suggestions. Some of my pandas DataFrames have 70 to 90K rows and 10 columns. I would appreciate any suggestions or sample code for a proof of concept (POC).

System Info

Name: langchain
Version: 0.2.15

@dosu · 2024-09-04T04:29:47Z

dosubot[bot]
bot Sep 4, 2024

I found a similar unsolved discussion that might be relevant to your issue: create_pandas_dataframe_agent ^[1].

To handle case-insensitive filtering in a chatbot created with the create_pandas_dataframe_agent function, you can convert the text to lowercase before performing the filtering. This ensures that the comparison is case-insensitive. Here is an example:

import pandas as pd

# Sample DataFrame
data = {'Employee Name': ['Alice', 'Bob', 'Charlie', 'alice'], 'Salary': [50000, 60000, 70000, 55000]}
df = pd.DataFrame(data)

# Case-insensitive filtering
search_name = 'bob'
filtered_df = df[df['Employee Name'].str.lower() == search_name.lower()]

print(filtered_df)

Regarding the use of Retrieval-Augmented Generation (RAG) for accurate results and handling computations based on aggregations, RAG is primarily designed for unstructured data and may not be the best fit for structured data like DataFrames where precise computations and aggregations are required.

For accurate results and handling computations based on aggregations, sticking with structured data processing techniques within pandas or similar libraries is recommended over converting the DataFrame to an unstructured format using RAG ^[2].

If you need to handle large datasets (e.g., 70 to 90K rows) efficiently, LangChain's RAG framework involves several components such as document loaders, text splitters, embedding models, and vector stores. Properly configuring these components can help manage large datasets effectively ^[3]^[2].

To continue talking to Dosu, mention @dosu.

4 replies

mraguth Sep 4, 2024
Author

@dosu... I agree with your suggestion. Since I am getting column level unique values present in the dataframe vs. what my user is asking, is there way I can match 'human' asks vs. what I have in the table? By creating unique values of each column --> look for human asks --> match with unique values in the each column --> send to query_str --> send to llm?..... is that possible via code?

dosubot[bot] bot Sep 4, 2024

Yes, you can match user queries with unique column values in a DataFrame by creating unique values for each column, comparing them with user inputs, and then sending the matched values to query_str for the LLM. Here is a method to handle this using LangChain:

Create an index of unique values: Use a vector store to create embeddings for the unique values in your DataFrame columns.
Retrieve the most relevant values: Use a retriever to find the most relevant values based on the user query.
Generate the query: Use the matched values to generate a relevant search query for the LLM.

Here is a code snippet to demonstrate this approach:

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
import pandas as pd

# Sample DataFrame
data = {'Employee Name': ['Alice', 'Bob', 'Charlie', 'alice'], 'Salary': [50000, 60000, 70000, 55000]}
df = pd.DataFrame(data)

# Create unique values for each column
unique_values = df['Employee Name'].unique().tolist()

# Create embeddings for the unique values
vector_db = FAISS.from_texts(unique_values, OpenAIEmbeddings())
retriever = vector_db.as_retriever(search_kwargs={"k": 15})

# Function to select values based on similarity search
def select_values(question):
    docs = retriever.retrieve(question)
    values = [doc.page_content for doc in docs]
    return ", ".join(values)

# Create the prompt with selected values
system = """You are a SQLite expert. Given an input question, create a syntactically
correct SQLite query to run. Unless otherwise specified, do not return more than
{top_k} rows.

Only return the SQL query with no markup or explanation.

Here is the relevant table info: {table_info}

Here is a non-exhaustive list of possible feature values. If filtering on a feature
value make sure to check its spelling against this list first:

{proper_nouns}
"""

prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{input}")])

query_chain = create_sql_query_chain(llm, db, prompt=prompt)
retriever_chain = (
    itemgetter("question")
    | retriever
    | (lambda docs: "\n".join(doc.page_content for doc in docs))
)
chain = RunnablePassthrough.assign(proper_nouns=retriever_chain) | query_chain

# Example usage
query = chain.invoke({"question": "What is the salary of Bob?"})
print(query)
db.run(query)

This code creates an index of unique values and uses a similarity search to find the most relevant values based on the user query. The matched values are then used to generate a relevant search query for the LLM ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

mraguth Sep 4, 2024
Author

@Looks promising. What about my dataframe has multiple columns? Like 'Employee Name', 'City', 'State' etc? can you adjust the code please? I will try asap. Thank you.

dosubot[bot] bot Sep 4, 2024

To handle a DataFrame with multiple columns like 'Employee Name', 'City', and 'State', you can adjust the code to create unique values for each column and use a vector store to match user queries with these values. Here is an updated version of the code:

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
import pandas as pd

# Sample DataFrame
data = {
    'Employee Name': ['Alice', 'Bob', 'Charlie', 'alice'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York'],
    'State': ['NY', 'CA', 'IL', 'NY'],
    'Salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)

# Create unique values for each column
unique_values = {col: df[col].unique().tolist() for col in df.columns}

# Flatten the unique values for embedding
flattened_values = [f"{col}: {val}" for col, vals in unique_values.items() for val in vals]

# Create embeddings for the unique values
vector_db = FAISS.from_texts(flattened_values, OpenAIEmbeddings())
retriever = vector_db.as_retriever(search_kwargs={"k": 15})

# Function to select values based on similarity search
def select_values(question):
    docs = retriever.retrieve(question)
    values = [doc.page_content for doc in docs]
    return ", ".join(values)

# Create the prompt with selected values
system = """You are a SQLite expert. Given an input question, create a syntactically
correct SQLite query to run. Unless otherwise specified, do not return more than
{top_k} rows.

Only return the SQL query with no markup or explanation.

Here is the relevant table info: {table_info}

Here is a non-exhaustive list of possible feature values. If filtering on a feature
value make sure to check its spelling against this list first:

{proper_nouns}
"""

prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{input}")])

query_chain = create_sql_query_chain(llm, db, prompt=prompt)
retriever_chain = (
    itemgetter("question")
    | retriever
    | (lambda docs: "\n".join(doc.page_content for doc in docs))
)
chain = RunnablePassthrough.assign(proper_nouns=retriever_chain) | query_chain

# Example usage
query = chain.invoke({"question": "What is the salary of Bob in New York?"})
print(query)
db.run(query)

This code handles multiple columns by creating unique values for each column, embedding them, and using a similarity search to find the most relevant values based on the user query. The matched values are then used to generate a relevant search query for the LLM ^[1]^[2].

To continue talking to Dosu, mention @dosu.

mraguth · 2024-09-04T12:31:39Z

mraguth
Sep 4, 2024
Author

@dosu I am getting this error NameError: name 'create_sql_query_chain' is not defined from above.

5 replies

dosubot[bot] bot Sep 4, 2024

To resolve the NameError: name 'create_sql_query_chain' is not defined error, you need to ensure that create_sql_query_chain is properly defined or imported in your code. The correct function to create a SQL query chain in LangChain is SQLDatabaseChain.from_llm.

Here is an updated version of your code using SQLDatabaseChain.from_llm:

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_experimental.sql import SQLDatabaseChain
from langchain_community.utilities import SQLDatabase
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
import pandas as pd

# Sample DataFrame
data = {
    'Employee Name': ['Alice', 'Bob', 'Charlie', 'alice'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York'],
    'State': ['NY', 'CA', 'IL', 'NY'],
    'Salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)

# Create unique values for each column
unique_values = {col: df[col].unique().tolist() for col in df.columns}

# Flatten the unique values for embedding
flattened_values = [f"{col}: {val}" for col, vals in unique_values.items() for val in vals]

# Create embeddings for the unique values
vector_db = FAISS.from_texts(flattened_values, OpenAIEmbeddings())
retriever = vector_db.as_retriever(search_kwargs={"k": 15})

# Function to select values based on similarity search
def select_values(question):
    docs = retriever.retrieve(question)
    values = [doc.page_content for doc in docs]
    return ", ".join(values)

# Create the prompt with selected values
system = """You are a SQLite expert. Given an input question, create a syntactically
correct SQLite query to run. Unless otherwise specified, do not return more than
{top_k} rows.

Only return the SQL query with no markup or explanation.

Here is the relevant table info: {table_info}

Here is a non-exhaustive list of possible feature values. If filtering on a feature
value make sure to check its spelling against this list first:

{proper_nouns}
"""

prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{input}")])

# Initialize the SQLDatabase object with your database connection details
db = SQLDatabase.from_uri("sqlite:///Chinook.db")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create an instance of SQLDatabaseChain using the OpenAI language model and the database connection
query_chain = SQLDatabaseChain.from_llm(llm, db, prompt=prompt)

retriever_chain = (
    itemgetter("question")
    | retriever
    | (lambda docs: "\n".join(doc.page_content for doc in docs))
)
chain = RunnablePassthrough.assign(proper_nouns=retriever_chain) | query_chain

# Example usage
query = chain.invoke({"question": "What is the salary of Bob in New York?"})
print(query)
db.run(query)

This code uses SQLDatabaseChain.from_llm to create the SQL query chain, ensuring that the function is properly defined and imported ^[1]^[2].

To continue talking to Dosu, mention @dosu.

mraguth Sep 4, 2024
Author

@dosu... from above I am getting NameError: name 'itemgetter' is not defined

dosubot[bot] bot Sep 4, 2024

To resolve the NameError: name 'itemgetter' is not defined error, you need to import itemgetter from the operator module. Here is the updated code with the necessary import:

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_experimental.sql import SQLDatabaseChain
from langchain_community.utilities import SQLDatabase
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter  # Import itemgetter
import pandas as pd

# Sample DataFrame
data = {
    'Employee Name': ['Alice', 'Bob', 'Charlie', 'alice'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York'],
    'State': ['NY', 'CA', 'IL', 'NY'],
    'Salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)

# Create unique values for each column
unique_values = {col: df[col].unique().tolist() for col in df.columns}

# Flatten the unique values for embedding
flattened_values = [f"{col}: {val}" for col, vals in unique_values.items() for val in vals]

# Create embeddings for the unique values
vector_db = FAISS.from_texts(flattened_values, OpenAIEmbeddings())
retriever = vector_db.as_retriever(search_kwargs={"k": 15})

# Function to select values based on similarity search
def select_values(question):
    docs = retriever.retrieve(question)
    values = [doc.page_content for doc in docs]
    return ", ".join(values)

# Create the prompt with selected values
system = """You are a SQLite expert. Given an input question, create a syntactically
correct SQLite query to run. Unless otherwise specified, do not return more than
{top_k} rows.

Only return the SQL query with no markup or explanation.

Here is the relevant table info: {table_info}

Here is a non-exhaustive list of possible feature values. If filtering on a feature
value make sure to check its spelling against this list first:

{proper_nouns}
"""

prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{input}")])

# Initialize the SQLDatabase object with your database connection details
db = SQLDatabase.from_uri("sqlite:///Chinook.db")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create an instance of SQLDatabaseChain using the OpenAI language model and the database connection
query_chain = SQLDatabaseChain.from_llm(llm, db, prompt=prompt)

retriever_chain = (
    itemgetter("question")
    | retriever
    | (lambda docs: "\n".join(doc.page_content for doc in docs))
)
chain = RunnablePassthrough.assign(proper_nouns=retriever_chain) | query_chain

# Example usage
query = chain.invoke({"question": "What is the salary of Bob in New York?"})
print(query)
db.run(query)

This should resolve the NameError and allow your code to run correctly.

To continue talking to Dosu, mention @dosu.

mraguth Sep 4, 2024
Author

@dosu I am getting this error ValueError: Missing some input keys: {'query'} Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?ef7b2490-ac67-4e9a-b5f8-787d57c638da) or open in a [text editor](command:workbench.action.openLargeOutput?ef7b2490-ac67-4e9a-b5f8-787d57c638da). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)... from query = chain.invoke({"question": "What is the salary of Bob in New York?"})

Same time, where I can use my pandas agent this below??

agent = create_pandas_dataframe_agent(
    llm,
    df,
    agent_type="openai-tools",
    verbose=True,
    return_intermediate_steps=True,
    allow_dangerous_code=True,
    allow_dangerous_requests=True,
)
agent.invoke({"input":query_str})

dosubot[bot] bot Sep 4, 2024

To resolve the ValueError: Missing some input keys: {'query'} error and integrate the pandas agent using create_pandas_dataframe_agent, you need to ensure that the input to the chain.invoke method includes the required key. Additionally, you need to set up the pandas agent correctly. Here is how you can do it:

Ensure the input to chain.invoke includes the required key.
Integrate the pandas agent using create_pandas_dataframe_agent.

Here is the updated code:

from pathlib import Path
import pandas as pd
from langchain.agents import AgentExecutor, OpenAIFunctionsAgent
from langchain_community.chat_models import ChatOpenAI
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.tools.retriever import create_retriever_tool
from langchain_experimental.tools import PythonAstREPLTool

MAIN_DIR = Path(__file__).parents[1]

pd.set_option("display.max_rows", 20)
pd.set_option("display.max_columns", 20)

embedding_model = OpenAIEmbeddings()
vectorstore = FAISS.load_local(MAIN_DIR / "titanic_data", embedding_model)
retriever_tool = create_retriever_tool(
    vectorstore.as_retriever(), "person_name_search", "Search for a person by name"
)

TEMPLATE = """You are working with a pandas dataframe in Python. The name of the dataframe is `df`.
It is important to understand the attributes of the dataframe before working with it. This is the result of running `df.head().to_markdown()`

<df>
{dhead}
</df>

You are not meant to use only these rows to answer questions - they are meant as a way of telling you about the shape and schema of the dataframe.
You also do not have use only the information here to answer questions - you can run intermediate queries to do exporatory data analysis to give you more information as needed.

You have a tool called `person_name_search` through which you can lookup a person by name and find the records corresponding to people with similar name as the query.
You should only really use this if your search term contains a persons name. Otherwise, try to solve it with code.

For example:

<question>How old is Jane?</question>
<logic>Use `person_name_search` since you can use the query `Jane`</logic>
<question>Who has id 320</question>
<logic>Use `python_repl` since even though the question is about a person, you don't know their name so you can't include it.</logic>
"""  # noqa: E501

class PythonInputs(BaseModel):
    query: str = Field(description="code snippet to run")

df = pd.read_csv(MAIN_DIR / "titanic.csv")
template = TEMPLATE.format(dhead=df.head().to_markdown())

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", template),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
        ("human", "{input}"),
    ]
)

repl = PythonAstREPLTool(
    locals={"df": df},
    name="python_repl",
    description="Runs code and returns the output of the final line",
    args_schema=PythonInputs,
)
tools = [repl, retriever_tool]
agent = OpenAIFunctionsAgent(
    llm=ChatOpenAI(temperature=0, model="gpt-4"), prompt=prompt, tools=tools
)
agent_executor = AgentExecutor(
    agent=agent, tools=tools, max_iterations=5, early_stopping_method="generate"
) | (lambda x: x["output"])

# Typing for playground inputs
class AgentInputs(BaseModel):
    input: str

agent_executor = agent_executor.with_types(input_type=AgentInputs)

# Create the pandas dataframe agent
from langchain.agents import create_pandas_dataframe_agent

pandas_agent = create_pandas_dataframe_agent(df, agent_executor)

# Invoke the chain with the correct input key
query = {"input": "What is the salary of Bob in New York?"}
result = pandas_agent.invoke(query)
print(result)

This code ensures that the input to chain.invoke includes the required key (input) and integrates the pandas agent using create_pandas_dataframe_agent ^[1]^[2].

To continue talking to Dosu, mention @dosu.

mraguth · 2024-09-04T13:12:11Z

mraguth
Sep 4, 2024
Author

@dosu I like to fix this error ValueError: Missing some input keys: {'query'} coming from your below code. I don't need new code:

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_experimental.sql import SQLDatabaseChain
from langchain_community.utilities import SQLDatabase
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter  # Import itemgetter
import pandas as pd

# Sample DataFrame
data = {
    'Employee Name': ['Alice', 'Bob', 'Charlie', 'alice'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York'],
    'State': ['NY', 'CA', 'IL', 'NY'],
    'Salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)

# Create unique values for each column
unique_values = {col: df[col].unique().tolist() for col in df.columns}

# Flatten the unique values for embedding
flattened_values = [f"{col}: {val}" for col,
                    vals in unique_values.items() for val in vals]

# Create embeddings for the unique values
vector_db = FAISS.from_texts(flattened_values, embeddings)
retriever = vector_db.as_retriever(search_kwargs={"k": 15})

# Function to select values based on similarity search


def select_values(question):
    docs = retriever.retrieve(question)
    values = [doc.page_content for doc in docs]
    return ", ".join(values)


# Create the prompt with selected values
system = """You are a SQLite expert. Given an input question, create a syntactically
correct SQLite query to run. Unless otherwise specified, do not return more than
{top_k} rows.

Only return the SQL query with no markup or explanation.

Here is the relevant table info: {table_info}

Here is a non-exhaustive list of possible feature values. If filtering on a feature
value make sure to check its spelling against this list first:

{proper_nouns}
"""

prompt = ChatPromptTemplate.from_messages(
    [("system", system), ("human", "{input}")])

# Initialize the SQLDatabase object with your database connection details
db = SQLDatabase.from_uri("sqlite:///Chinook.db")
llm = llm

# Create an instance of SQLDatabaseChain using the OpenAI language model and the database connection
query_chain = SQLDatabaseChain.from_llm(llm, db, prompt=prompt)

retriever_chain = (
    itemgetter("question")
    | retriever
    | (lambda docs: "\n".join(doc.page_content for doc in docs))
)
chain = RunnablePassthrough.assign(proper_nouns=retriever_chain) | query_chain

# Example usage
#query = chain.invoke({"question": "What is the salary of Bob in New York?"})
query = chain.invoke({"question": "What is the salary of Bob in New York?"})
print(query)
db.run(query)

1 reply

dosubot[bot] bot Sep 4, 2024

To fix the ValueError: Missing some input keys: {'query'} error, you need to ensure that the input dictionary passed to chain.invoke includes the required key query. The SQLDatabaseChain expects the key query instead of question.

Here is the corrected code:

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_experimental.sql import SQLDatabaseChain
from langchain_community.utilities import SQLDatabase
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter  # Import itemgetter
import pandas as pd

# Sample DataFrame
data = {
    'Employee Name': ['Alice', 'Bob', 'Charlie', 'alice'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York'],
    'State': ['NY', 'CA', 'IL', 'NY'],
    'Salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)

# Create unique values for each column
unique_values = {col: df[col].unique().tolist() for col in df.columns}

# Flatten the unique values for embedding
flattened_values = [f"{col}: {val}" for col, vals in unique_values.items() for val in vals]

# Create embeddings for the unique values
embeddings = OpenAIEmbeddings()  # Ensure you have an instance of OpenAIEmbeddings
vector_db = FAISS.from_texts(flattened_values, embeddings)
retriever = vector_db.as_retriever(search_kwargs={"k": 15})

# Function to select values based on similarity search
def select_values(question):
    docs = retriever.retrieve(question)
    values = [doc.page_content for doc in docs]
    return ", ".join(values)

# Create the prompt with selected values
system = """You are a SQLite expert. Given an input question, create a syntactically
correct SQLite query to run. Unless otherwise specified, do not return more than
{top_k} rows.

Only return the SQL query with no markup or explanation.

Here is the relevant table info: {table_info}

Here is a non-exhaustive list of possible feature values. If filtering on a feature
value make sure to check its spelling against this list first:

{proper_nouns}
"""

prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{input}")])

# Initialize the SQLDatabase object with your database connection details
db = SQLDatabase.from_uri("sqlite:///Chinook.db")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create an instance of SQLDatabaseChain using the OpenAI language model and the database connection
query_chain = SQLDatabaseChain.from_llm(llm, db, prompt=prompt)

retriever_chain = (
    itemgetter("question")
    | retriever
    | (lambda docs: "\n".join(doc.page_content for doc in docs))
)
chain = RunnablePassthrough.assign(proper_nouns=retriever_chain) | query_chain

# Example usage
query = chain.invoke({"query": "What is the salary of Bob in New York?"})
print(query)
db.run(query)

Make sure to replace {"question": "What is the salary of Bob in New York?"} with {"query": "What is the salary of Bob in New York?"} when invoking the chain. This should resolve the ValueError ^[1].

To continue talking to Dosu, mention @dosu.

mraguth · 2024-09-04T16:54:56Z

mraguth
Sep 4, 2024
Author

@dosu, how I replace this code with my pandas data frame agent invoke?

retriever_chain = (
    itemgetter("question")
    | retriever
    | (lambda docs: "\n".join(doc.page_content for doc in docs))
)
chain = RunnablePassthrough.assign(proper_nouns=retriever_chain) | query_chain

# Example usage
#query = chain.invoke({"question": "What is the salary of Bob in New York?"})
query = chain.invoke({"question": "What is the salary of Bob in New York?"})
print(query)
db.run(query)

agent = create_pandas_dataframe_agent(
    llm,
    df,
    agent_type="openai-tools",
    verbose=True,
    return_intermediate_steps=True,
    allow_dangerous_code=True,
    allow_dangerous_requests=True,
)
agent.invoke({"input":question})

1 reply

dosubot[bot] bot Sep 4, 2024

To replace the code using chain.invoke with your pandas DataFrame agent invoke, you can use the create_pandas_dataframe_agent function. Here is how you can modify your code:

Original code:

retriever_chain = (
    itemgetter("question")
    | retriever
    | (lambda docs: "\n".join(doc.page_content for doc in docs))
)
chain = RunnablePassthrough.assign(proper_nouns=retriever_chain) | query_chain

# Example usage
query = chain.invoke({"question": "What is the salary of Bob in New York?"})
print(query)
db.run(query)

Modified code using create_pandas_dataframe_agent:

agent = create_pandas_dataframe_agent(
    llm,
    df,
    agent_type="openai-tools",
    verbose=True,
    return_intermediate_steps=True,
    allow_dangerous_code=True,
    allow_dangerous_requests=True,
)

# Example usage
query = agent.invoke({"input": "What is the salary of Bob in New York?"})
print(query)
db.run(query)

This modification uses the create_pandas_dataframe_agent to create an agent that can handle the DataFrame and then invokes it with the question. The create_pandas_dataframe_agent function constructs a Pandas agent from a language model and DataFrame(s), allowing for the execution of arbitrary code in a sandboxed environment ^[1]^[2].

To continue talking to Dosu, mention @dosu.

mraguth · 2024-09-04T18:04:43Z

mraguth
Sep 4, 2024
Author

@dosu this is not working example.... it is very sad to say.... do you have complete code snippet of any working example? picking up code one pace and dupming as a user's answer is not working unfortunately :-(

1 reply

dosubot[bot] bot Sep 4, 2024

Here is a complete code snippet of a working example using create_pandas_dataframe_agent with a pandas DataFrame and handling user queries:

from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain_openai import ChatOpenAI
import pandas as pd

# Load your DataFrame
df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv")

# Initialize the language model
llm = ChatOpenAI(model="gpt-3.5-turbo-0613", temperature=0)

# Create the agent executor using the OPENAI_FUNCTIONS agent type
agent_executor = create_pandas_dataframe_agent(
    llm,
    df,
    agent_type=AgentType.OPENAI_FUNCTIONS,
    verbose=True
)

# Example of handling a user query
query = "What is the average age of passengers?"
response = agent_executor.invoke(query)
print(response)

# Additional examples of handling user queries
print(agent_executor.invoke("how many rows are there?"))
print(agent_executor.invoke("how many people have more than 3 siblings"))
print(agent_executor.invoke("whats the square root of the average age?"))

This code snippet demonstrates how to use create_pandas_dataframe_agent with a pandas DataFrame and handle user queries. It uses the OPENAI_FUNCTIONS agent type for initialization and includes additional examples of handling user queries. Note that using a library like Pandas requires letting the model execute Python code, which carries significant security risks. For production use cases, consider using SQL to interact with CSV data as it is easier to limit permissions and sanitize queries ^[1]^[2]^[3]^[4].

This is a tough nut to crack, could really use your help on this, @eyurtsev.

To continue talking to Dosu, mention @dosu.

mraguth · 2024-09-04T18:11:36Z

mraguth
Sep 4, 2024
Author

@dosu I am looking for example for this context...

Yes, you can match user queries with unique column values in a DataFrame (with multiple columns) by creating unique values for each column, comparing them with user inputs, and then sending the matched values to query_str for the LLM. Here is a method to handle this using LangChain:

Create an index of unique values: Use a vector store to create embeddings for the unique values in your DataFrame columns.
Retrieve the most relevant values: Use a retriever to find the most relevant values based on the user query.
Generate the query: Use the matched values to generate a relevant search query for the LLM.

7 replies

dosubot[bot] bot Sep 4, 2024

To resolve the ValueError: Missing some input keys: {'proper_nouns'} error, you need to ensure that the proper_nouns key is correctly provided in the input dictionary. Here is the updated code to include the proper_nouns key:

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_experimental.sql import SQLDatabaseChain
from langchain_community.utilities import SQLDatabase
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
import pandas as pd

# Sample DataFrame
data = {
    'Employee Name': ['Alice', 'Bob', 'Charlie', 'alice'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York'],
    'State': ['NY', 'CA', 'IL', 'NY'],
    'Salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)

# Create unique values for each column
unique_values = {col: df[col].unique().tolist() for col in df.columns}

# Flatten the unique values for embedding
flattened_values = [f"{col}: {val}" for col, vals in unique_values.items() for val in vals]

# Create embeddings for the unique values
embeddings = OpenAIEmbeddings()  # Ensure you have an instance of OpenAIEmbeddings
vector_db = FAISS.from_texts(flattened_values, embeddings)
retriever = vector_db.as_retriever(search_kwargs={"k": 15})

# Function to select values based on similarity search
def select_values(question):
    docs = retriever.retrieve(question)
    values = [doc.page_content for doc in docs]
    return ", ".join(values)

# Create the prompt with selected values
system = """You are a SQLite expert. Given an input question, create a syntactically
correct SQLite query to run. Unless otherwise specified, do not return more than
{top_k} rows.

Only return the SQL query with no markup or explanation.

Here is the relevant table info: {table_info}

Here is a non-exhaustive list of possible feature values. If filtering on a feature
value make sure to check its spelling against this list first:

{proper_nouns}
"""

prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{input}")])

# Initialize the SQLDatabase object with your database connection details
db = SQLDatabase.from_uri("sqlite:///Chinook.db")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create an instance of SQLDatabaseChain using the OpenAI language model and the database connection
query_chain = SQLDatabaseChain.from_llm(llm, db, prompt=prompt)

retriever_chain = (
    itemgetter("query")
    | retriever
    | (lambda docs: "\n".join(doc.page_content for doc in docs))
)
chain = RunnablePassthrough.assign(proper_nouns=retriever_chain) | query_chain

# Example usage
query = chain.invoke({"query": "What is the salary of Bob in New York?", "proper_nouns": select_values("What is the salary of Bob in New York?")})
print(query)
db.run(query)

In this code:

The select_values function is used to retrieve the most relevant values based on the user query.
The proper_nouns key is included in the input dictionary passed to chain.invoke.

This should resolve the ValueError and ensure that the proper_nouns key is correctly provided ^[1].

To continue talking to Dosu, mention @dosu.

mraguth Sep 4, 2024
Author

@dosu Error again "AttributeError: 'VectorStoreRetriever' object has no attribute 'retrieve'"

dosubot[bot] bot Sep 4, 2024

To resolve the "AttributeError: 'VectorStoreRetriever' object has no attribute 'retrieve'" error, you should ensure that you are using the correct method provided by the VectorStoreRetriever class. According to the LangChain documentation, the VectorStoreRetriever class does not have a retrieve method. Instead, you should use the retrieve method provided by the VectorStoreRetriever object.

Here is an example of how you can use the VectorStoreRetriever correctly:

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_experimental.sql import SQLDatabaseChain
from langchain_community.utilities import SQLDatabase
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
import pandas as pd

# Sample DataFrame
data = {
    'Employee Name': ['Alice', 'Bob', 'Charlie', 'alice'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York'],
    'State': ['NY', 'CA', 'IL', 'NY'],
    'Salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)

# Create unique values for each column
unique_values = {col: df[col].unique().tolist() for col in df.columns}

# Flatten the unique values for embedding
flattened_values = [f"{col}: {val}" for col, vals in unique_values.items() for val in vals]

# Create embeddings for the unique values
embeddings = OpenAIEmbeddings()  # Ensure you have an instance of OpenAIEmbeddings
vector_db = FAISS.from_texts(flattened_values, embeddings)
retriever = vector_db.as_retriever(search_kwargs={"k": 15})

# Function to select values based on similarity search
def select_values(question):
    docs = retriever.retrieve(question)
    values = [doc.page_content for doc in docs]
    return ", ".join(values)

# Create the prompt with selected values
system = """You are a SQLite expert. Given an input question, create a syntactically
correct SQLite query to run. Unless otherwise specified, do not return more than
{top_k} rows.

Only return the SQL query with no markup or explanation.

Here is the relevant table info: {table_info}

Here is a non-exhaustive list of possible feature values. If filtering on a feature
value make sure to check its spelling against this list first:

{proper_nouns}
"""

prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{input}")])

# Initialize the SQLDatabase object with your database connection details
db = SQLDatabase.from_uri("sqlite:///Chinook.db")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create an instance of SQLDatabaseChain using the OpenAI language model and the database connection
query_chain = SQLDatabaseChain.from_llm(llm, db, prompt=prompt)

retriever_chain = (
    itemgetter("query")
    | retriever
    | (lambda docs: "\n".join(doc.page_content for doc in docs))
)
chain = RunnablePassthrough.assign(proper_nouns=retriever_chain) | query_chain

# Example usage
query = chain.invoke({"query": "What is the salary of Bob in New York?", "proper_nouns": select_values("What is the salary of Bob in New York?")})
print(query)
db.run(query)

In this code:

The retrieve method is correctly used on the retriever object to get the relevant documents.
The proper_nouns key is included in the input dictionary passed to chain.invoke.

This should resolve the AttributeError and ensure that the retrieve method is used correctly ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

mraguth Sep 4, 2024
Author

@error again:

Cell In[156], [line 36](vscode-notebook-cell:?execution_count=156&line=36)
     [35](vscode-notebook-cell:?execution_count=156&line=35) def select_values(question):
---> [36](vscode-notebook-cell:?execution_count=156&line=36)     docs = retriever.retrieve(question)
     [37](vscode-notebook-cell:?execution_count=156&line=37)     values = [doc.page_content for doc in docs]
     [38](vscode-notebook-cell:?execution_count=156&line=38)     return ", ".join(values)

AttributeError: 'VectorStoreRetriever' object has no attribute 'retrieve'

dosubot[bot] bot Sep 4, 2024

To resolve the AttributeError: 'VectorStoreRetriever' object has no attribute 'retrieve' error, you should use the correct method provided by the VectorStoreRetriever class. The correct method to retrieve documents is retrieve.

Here is the updated code:

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_experimental.sql import SQLDatabaseChain
from langchain_community.utilities import SQLDatabase
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
import pandas as pd

# Sample DataFrame
data = {
    'Employee Name': ['Alice', 'Bob', 'Charlie', 'alice'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York'],
    'State': ['NY', 'CA', 'IL', 'NY'],
    'Salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)

# Create unique values for each column
unique_values = {col: df[col].unique().tolist() for col in df.columns}

# Flatten the unique values for embedding
flattened_values = [f"{col}: {val}" for col, vals in unique_values.items() for val in vals]

# Create embeddings for the unique values
embeddings = OpenAIEmbeddings()  # Ensure you have an instance of OpenAIEmbeddings
vector_db = FAISS.from_texts(flattened_values, embeddings)
retriever = vector_db.as_retriever(search_kwargs={"k": 15})

# Function to select values based on similarity search
def select_values(question):
    docs = retriever.retrieve(question)
    values = [doc.page_content for doc in docs]
    return ", ".join(values)

# Create the prompt with selected values
system = """You are a SQLite expert. Given an input question, create a syntactically
correct SQLite query to run. Unless otherwise specified, do not return more than
{top_k} rows.

Only return the SQL query with no markup or explanation.

Here is the relevant table info: {table_info}

Here is a non-exhaustive list of possible feature values. If filtering on a feature
value make sure to check its spelling against this list first:

{proper_nouns}
"""

prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{input}")])

# Initialize the SQLDatabase object with your database connection details
db = SQLDatabase.from_uri("sqlite:///Chinook.db")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create an instance of SQLDatabaseChain using the OpenAI language model and the database connection
query_chain = SQLDatabaseChain.from_llm(llm, db, prompt=prompt)

retriever_chain = (
    itemgetter("query")
    | retriever
    | (lambda docs: "\n".join(doc.page_content for doc in docs))
)
chain = RunnablePassthrough.assign(proper_nouns=retriever_chain) | query_chain

# Example usage
query = chain.invoke({"query": "What is the salary of Bob in New York?", "proper_nouns": select_values("What is the salary of Bob in New York?")})
print(query)
db.run(query)

This code ensures that the retrieve method is used correctly on the retriever object to get the relevant documents ^[1]^[2].

To continue talking to Dosu, mention @dosu.

mraguth · 2024-09-04T23:17:01Z

mraguth
Sep 4, 2024
Author

@dosu... I am reading this article https://python.langchain.com/v0.2/docs/how_to/query_high_cardinality/#query-analysis
How deal with high cardinality categoricals when doing query analysis - very facinating it is finding all relevant values and replacing with right values at the end.... Is that possible to apply the similar concept to my pandas dataframe? can you help me in the sample code please?

0 replies

Chat bot with Pandas Dataframe Agent - Need exact values while filtering the values. How I mitigate the problem? #26012

Uh oh!

mraguth Sep 4, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 7 comments · 19 replies

Uh oh!

dosubot[bot] bot Sep 4, 2024

Uh oh!

mraguth Sep 4, 2024 Author

Uh oh!

dosubot[bot] bot Sep 4, 2024

Uh oh!

mraguth Sep 4, 2024 Author

Uh oh!

dosubot[bot] bot Sep 4, 2024

Uh oh!

mraguth Sep 4, 2024 Author

Uh oh!

dosubot[bot] bot Sep 4, 2024

Uh oh!

mraguth Sep 4, 2024 Author

Uh oh!

dosubot[bot] bot Sep 4, 2024

Uh oh!

mraguth Sep 4, 2024 Author

Uh oh!

dosubot[bot] bot Sep 4, 2024

Uh oh!

mraguth Sep 4, 2024 Author

Uh oh!

dosubot[bot] bot Sep 4, 2024

Uh oh!

mraguth Sep 4, 2024 Author

Uh oh!

dosubot[bot] bot Sep 4, 2024

Uh oh!

mraguth Sep 4, 2024 Author

Uh oh!

dosubot[bot] bot Sep 4, 2024

Uh oh!

mraguth Sep 4, 2024 Author

Uh oh!

dosubot[bot] bot Sep 4, 2024

Uh oh!

mraguth Sep 4, 2024 Author

Uh oh!

dosubot[bot] bot Sep 4, 2024

Uh oh!

mraguth Sep 4, 2024 Author

Uh oh!

dosubot[bot] bot Sep 4, 2024

Uh oh!

mraguth Sep 4, 2024 Author

mraguth
Sep 4, 2024

Replies: 7 comments 19 replies

dosubot[bot]
bot Sep 4, 2024

mraguth Sep 4, 2024
Author

mraguth Sep 4, 2024
Author

mraguth
Sep 4, 2024
Author

mraguth Sep 4, 2024
Author

mraguth Sep 4, 2024
Author

mraguth
Sep 4, 2024
Author

mraguth
Sep 4, 2024
Author

mraguth
Sep 4, 2024
Author

mraguth
Sep 4, 2024
Author

mraguth Sep 4, 2024
Author

mraguth Sep 4, 2024
Author

mraguth
Sep 4, 2024
Author