Can anyone explain about How to use retriever in Langchain #9302

manojahi · 2023-08-16T09:27:11Z

manojahi
Aug 16, 2023

I looked through lot of documentation but got confused on the retriever part.

So I am building a chatbot using user's custom data.

User will feed the data
Data should be upserted to Pinecone
Then later user can chat with their data
there can be multiple users and each user will be able to chat with their own data.

Now I am following below approach

Storing user data into Pinecone

def doc_preprocessing(content):
    doc = Document(page_content=content)
    text_splitter = CharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=0
    )
    docs_split = text_splitter.split_documents([doc])
    return docs_split

def embedding_db(user_id, content):
    docs_split = doc_preprocessing(content)
    # Extract text from the split documents
    texts = [doc.page_content for doc in docs_split]
    vectors = embeddings.embed_documents(texts)

    # Store vectors with user_id as metadata
    for i, vector in enumerate(vectors):
        upsert_response = index.upsert(
            vectors=[
                {
                    'id': f"{user_id}",
                    'values': vector,
                    'metadata': {"user_id": str(user_id)}
                }
            ]
        )

This way it should create embeddings for the given data into pinecone.

Now the second part is to chat with this data. For QA, I have below

def retrieval_answer(user_id, query):
    text_field = "text"
    vectorstore = Pinecone(
        index, embeddings.embed_query, text_field
    )

    vectorstore.similarity_search(
        query,
        k=10,
        filter={
            "user_id": str(user_id)
        },
    )

    qa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type='stuff',
        retriever=vectorstore.as_retriever(),
    )
    result = qa.run(query)
    print("Result:", result)
    return result

but I keep getting Found document with no text key. Skipping.

When i am doing QA, its not referring to the data stored in pinecone. Its just using the normal chatgpt. I am not sure what i am missing here. any help would be helpful.

Thanks

@dosu-bot · 2023-08-16T14:10:19Z

dosubot[bot]
bot Aug 16, 2023

🤖

Hi there,

From your code, it seems like you're storing the user's data into Pinecone correctly. However, when LangChain retrieves documents from Pinecone, it expects each document to have a context key in the metadata. This context key is used as the page_content key internally by LangChain when retrieving documents. If your documents do not have this key in the metadata, you will encounter the error "Found document with no text key. Skipping."

So, when you're creating your documents from your custom data, make sure to include a context key with the text of the document in the metadata. Here's how you can modify your embedding_db function to do this:

def embedding_db(user_id, content):
    docs_split = doc_preprocessing(content)
    # Extract text from the split documents
    texts = [doc.page_content for doc in docs_split]
    vectors = embeddings.embed_documents(texts)

    # Store vectors with user_id and context as metadata
    for i, vector in enumerate(vectors):
        upsert_response = index.upsert(
            vectors=[
                {
                    'id': f"{user_id}",
                    'values': vector,
                    'metadata': {"user_id": str(user_id), "context": texts[i]}
                }
            ]
        )

In this modified function, texts[i] should contain the text of the document. This will ensure that each document has a context key in the metadata, which should resolve the error you're encountering.

For more information, you can refer to the following sources in the LangChain repository:

I hope this helps! Let me know if you have any other questions.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

2 replies

manojahi Aug 16, 2023
Author

@dosu-bot thanks. Now its taking the context. but there is one more problem that i am seeing.

so i added two texts to pinecone embeddings but when i do the chat/retrievals it's not taking both into consideration. I think the similarity is not working properly. this is my code

def retrieval_answer(user_id, query):
    vectorstore = Pinecone(index, embeddings.embed_query, "context")
    print("Query Vector:", vectorstore)

    d = vectorstore.similarity_search(
        query,
        k=20,
        # filter={
        #     "user_id": str(user_id)
        # },
    )
    print(d)

    qa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type='stuff',
        retriever=vectorstore.as_retriever(),
    )
    result = qa.run(query)
    print("Result:", result)
    return result

The first time I inserted some texts about "my new product" and asked questions about it, it was able to give a response based on that.
The second time I inserted about "elon musk and openAI" and asked questions about it, it was able to give a response based on that.
then when i ask about my product, it was not able to give a response.

dosubot[bot] bot Aug 16, 2023

🤖

Hello,

Based on the information you've provided, it seems like the issue might be related to how the retrieval function is implemented in LangChain. Specifically, the _get_relevant_documents method in the ParentDocumentRetriever class might be causing the issue. This method uses the similarity_search function of the vectorstore object to find the most relevant documents to a given query. However, it only returns the unique documents, as it checks if the document's id is not already in the list of ids before appending it.

This might be why you're experiencing that not all inserted texts are being considered when performing a similarity search. If multiple documents share the same parent id, only the first one encountered will be considered in the final result.

To fix this issue, you could modify the _get_relevant_documents method to return all documents, regardless of their parent id. Here's how you can do it:

def _get_relevant_documents(
    self, query: str, *, run_manager: CallbackManagerForRetrieverRun
) -> List[Document]:
    """Get documents relevant to a query.
    Args:
        query: String to find relevant documents for
        run_manager: The callbacks handler to use
    Returns:
        List of relevant documents
    """
    sub_docs = self.vectorstore.similarity_search(query, **self.search_kwargs)
    ids = [d.metadata[self.id_key] for d in sub_docs]
    docs = self.docstore.mget(ids)
    return [d for d in docs if d is not None]

Please note that this will return all documents, including those with the same parent id, sorted by their relevance to the query, which might not be what you want if you're only interested in unique parent documents. If that's the case, you should keep the original implementation and understand that not all documents will be considered in the final result.

If you're using the similarity_search function and still experiencing the issue, the problem might be in the _search method or how you're using the similarity_search function. Make sure you're passing the correct parameters and using the function as intended.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can anyone explain about How to use retriever in Langchain #9302

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Can anyone explain about How to use retriever in Langchain #9302

Uh oh!

manojahi Aug 16, 2023

Replies: 1 comment · 2 replies

Uh oh!

dosubot[bot] bot Aug 16, 2023

Sources

Uh oh!

manojahi Aug 16, 2023 Author

Uh oh!

dosubot[bot] bot Aug 16, 2023

Sources

manojahi
Aug 16, 2023

Replies: 1 comment 2 replies

dosubot[bot]
bot Aug 16, 2023

manojahi Aug 16, 2023
Author