-
Notifications
You must be signed in to change notification settings - Fork 5k
Description
So, I’m currently working on a RAG chatbot using this repository. I’ve already implemented a SharePoint pipeline that captures all changed content daily, updates Blob Storage, and applies the necessary index changes to keep our chatbot up-to-date. This setup is working really well and is already delivering great results within our company. Thanks for this amazing repo❤️
Now, we want to scale up with additional data sources. So far, we’ve been working with documents, but we also want to include a new source: CMDB/EAM content. You can think of this as SQL-like data containing tickets created in our company, with information such as ticket ID, title, solution, status, and other related metadata.
I’m looking for advice on how to structure this. My idea is to create one index that includes fields for both the SharePoint and the CMDB data source. Metadata fields specific to the CMDB would be populated when the document comes from the CMDB, while SharePoint fields would remain null, and vice versa.
For the content field, SharePoint entries would use the document chunks as usual in this repo, while the CMDB entries would use a concatenated string of all relevant metadata for example:
"content": "JobId: …, Title: …, Description: …"
This allows the field to be used for embeddings.
I plan to use Agentic Search to handle more complex queries, especially for the CMDB content, as users will likely ask multiple questions in a single prompt.
My question is: does this approach seem appropriate? Or is combining different types of data sources in a single index not recommended? Would multi-indexing with a custom orchestration or even multiple chatbots be a better approach? What is considered best practice for a scenario like this?