Q&A over long list of Questions and Answers - retreival is not good - how to tackle this problem? #7396

SamoPP · 2023-07-08T09:54:15Z

SamoPP
Jul 8, 2023

Hi,

I am trying to do Q&A over a large set of Questions and Answers (not documents like PDF, md, ...). Like I would prepare for an exam and I have 2000 previously asked questions and have definite answers on them.

I have used OpenAI Embeddings and FAISS as a vector store to store embedded Question and Answer pairs, Langchain Document looks something like this Document(page_content='Question: This is one of many questions? Answer: Yes. This is the correct and verified answer on the question.', metadata={'ID': 1, 'source': 'Some great resource.'}). I am using Langchain Retrieval QA chain. I use Microsoft Azure OpenAI API. Technically it all works great. Most of the times answer returned is good/satisfying. But there are to many misses. When I look at the documents used as context I can see poor retrieval of candidate Q&A pars as Langcahin Document from vector store.

What would be the best strategy at improving this? (I believe that the problem is in vector store returning non optimal suggestion sources as context.)

I was thinking about using MultiQueryRetriever but i doubt it is the right way. Perhaps the "reverse" MultiQueryRetriever: for each Question and Answer pair you have first generate more plausible variations of questions (like 5) - now you have 6 Q&A pairs with same answer (original question, 5 generated by LLM and same answer for all) - embed them and store in the vector store and do the Retrieval QA chain...

I assume the problem would be better solved in some other way. Any ideas how to tackle this? Thanks.

2023-07-08T11:31:58Z

dosubot[bot]
bot Jul 8, 2023

Answer generated by a 🤖

Answer

I understand that you're currently using OpenAI Embeddings and FAISS as a vector store to handle a large set of Q&A pairs for your exam preparation. You've noticed that while the system works well most of the time, there are instances where the retrieval of candidate Q&A pairs is not optimal. You're considering using MultiQueryRetriever or generating more plausible variations of questions to improve this, but you're open to other suggestions.

One potential solution could be to experiment with different embeddings or vector stores. For example, you might want to try using TransformerEmbeddings or SentenceTransformersEmbeddings. These embeddings might be able to capture the semantic meaning of your Q&A pairs more effectively than the current OpenAI Embeddings.

In terms of the vector store, you could consider trying out other options such as Annoy. Different vector stores use different indexing and search algorithms, which could potentially improve the retrieval of relevant documents.

Another aspect to consider is adjusting the chunk_size and chunk_overlap parameters in the CharacterTextSplitter. This could affect the granularity of the documents being indexed and, consequently, the retrieval performance.

Please note that these are just suggestions and the effectiveness can vary depending on your specific use case. The key is to experiment with different configurations to see what works best for you.

If you find a solution that works well and is not covered in the existing documentation, we encourage you to make a pull request to update the documentation. This could be very helpful for other users who might encounter a similar issue in the future.

Sources

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Q&A over long list of Questions and Answers - retreival is not good - how to tackle this problem? #7396

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Q&A over long list of Questions and Answers - retreival is not good - how to tackle this problem? #7396

Uh oh!

SamoPP Jul 8, 2023

Replies: 1 comment

Uh oh!

dosubot[bot] bot Jul 8, 2023

Answer

Sources

SamoPP
Jul 8, 2023

dosubot[bot]
bot Jul 8, 2023