-
Notifications
You must be signed in to change notification settings - Fork 5k
Documentation improvements: Remove duplicate READMEs, consistent titles #2118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
a5cbe18
fb6f5e4
67c126f
bc75292
e90a187
72227f8
d09b85d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -1,21 +1,24 @@ | ||||||||||
# Indexing documents for the Chat App | ||||||||||
# RAG chat: Data ingestion | ||||||||||
|
||||||||||
This guide provides more details for using the `prepdocs` script to index documents for the Chat App. | ||||||||||
The [azure-search-openai-demo](/) project can set up a full RAG chat app on Azure AI Search and OpenAI so that you can chat on custom data, like internal enterprise data or domain-specific knowledge sets. For full instructions on setting up the project, consult the [main README](/README.md), and then return here for detailed instructions on the data ingestion component. | ||||||||||
|
||||||||||
The chat app provides two ways to ingest data: manual indexing and integrated vectorization. This document explains the differences between the two approaches and provides an overview of the manual indexing process. | ||||||||||
|
||||||||||
- [Supported document formats](#supported-document-formats) | ||||||||||
- [Overview of the manual indexing process](#overview-of-the-manual-indexing-process) | ||||||||||
- [Manual indexing process](#manual-indexing-process) | ||||||||||
- [Chunking](#chunking) | ||||||||||
- [Categorizing data for enhanced search](#enhancing-search-functionality-with-data-categorization) | ||||||||||
- [Indexing additional documents](#indexing-additional-documents) | ||||||||||
- [Removing documents](#removing-documents) | ||||||||||
- [Overview of Integrated Vectorization](#overview-of-integrated-vectorization) | ||||||||||
- [Integrated Vectorization](#integrated-vectorization) | ||||||||||
- [Indexing of additional documents](#indexing-of-additional-documents) | ||||||||||
- [Removal of documents](#removal-of-documents) | ||||||||||
- [Scheduled indexing](#scheduled-indexing) | ||||||||||
- [Debugging tips](#debugging-tips) | ||||||||||
|
||||||||||
## Supported document formats | ||||||||||
|
||||||||||
In order to ingest a document format, we need a tool that can turn it into text. By default, use Azure Document Intelligence (DI in the table below), but we also have local parsers for several formats. The local parsers are not as sophisticated as Azure Document Intelligence, but they can be used to decrease charges. | ||||||||||
In order to ingest a document format, we need a tool that can turn it into text. By default, the manual indexing uses Azure Document Intelligence (DI in the table below), but we also have local parsers for several formats. The local parsers are not as sophisticated as Azure Document Intelligence, but they can be used to decrease charges. | ||||||||||
|
||||||||||
| Format | Manual indexing | Integrated Vectorization | | ||||||||||
| ------ | ------------------------------------ | ------------------------ | | ||||||||||
|
@@ -29,7 +32,7 @@ In order to ingest a document format, we need a tool that can turn it into text. | |||||||||
|
||||||||||
The Blob indexer used by the Integrated Vectorization approach also supports a few [additional formats](https://learn.microsoft.com/azure/search/search-howto-indexing-azure-blob-storage#supported-document-formats). | ||||||||||
|
||||||||||
## Overview of the manual indexing process | ||||||||||
## Manual indexing process | ||||||||||
|
||||||||||
The [`prepdocs.py`](../app/backend/prepdocs.py) script is responsible for both uploading and indexing documents. The typical usage is to call it using `scripts/prepdocs.sh` (Mac/Linux) or `scripts/prepdocs.ps1` (Windows), as these scripts will set up a Python virtual environment and pass in the required parameters based on the current `azd` environment. You can pass additional arguments directly to the script, for example `scripts/prepdocs.ps1 --removeall`. Whenever `azd up` or `azd provision` is run, the script is called automatically. | ||||||||||
|
||||||||||
|
@@ -42,12 +45,6 @@ The script uses the following steps to index documents: | |||||||||
3. Split the PDFs into chunks of text. | ||||||||||
4. Upload the chunks to Azure AI Search. If using vectors (the default), also compute the embeddings and upload those alongside the text. | ||||||||||
|
||||||||||
### Enhancing search functionality with data categorization | ||||||||||
|
||||||||||
To enhance search functionality, categorize data during the ingestion process with the `--category` argument, for example `scripts/prepdocs.ps1 --category ExampleCategoryName`. This argument specifies the category to which the data belongs, enabling you to filter search results based on these categories. | ||||||||||
|
||||||||||
After running the script with the desired category, ensure these categories are added to the 'Include Category' dropdown list. This can be found in the developer settings of [`Chat.tsx`](../app/frontend/src/pages/chat/Chat.tsx) and [`Ask.tsx`](../app/frontend/src/pages/ask/Ask.tsx). The default option for this dropdown is "All". By including specific categories, you can refine your search results more effectively. | ||||||||||
|
||||||||||
### Chunking | ||||||||||
|
||||||||||
We're often asked why we need to break up the PDFs into chunks when Azure AI Search supports searching large documents. | ||||||||||
|
@@ -56,6 +53,12 @@ Chunking allows us to limit the amount of information we send to OpenAI due to t | |||||||||
|
||||||||||
If needed, you can modify the chunking algorithm in `scripts/prepdocslib/textsplitter.py`. | ||||||||||
|
||||||||||
### Enhancing search functionality with data categorization | ||||||||||
pamelafox marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
To enhance search functionality, categorize data during the ingestion process with the `--category` argument, for example `scripts/prepdocs.ps1 --category ExampleCategoryName`. This argument specifies the category to which the data belongs, enabling you to filter search results based on these categories. | ||||||||||
|
||||||||||
After running the script with the desired category, ensure these categories are added to the 'Include Category' dropdown list. This can be found in the developer settings of [`Chat.tsx`](../app/frontend/src/pages/chat/Chat.tsx) and [`Ask.tsx`](../app/frontend/src/pages/ask/Ask.tsx). The default option for this dropdown is "All". By including specific categories, you can refine your search results more effectively. | ||||||||||
|
||||||||||
### Indexing additional documents | ||||||||||
|
||||||||||
To upload more PDFs, put them in the data/ folder and run `./scripts/prepdocs.sh` or `./scripts/prepdocs.ps1`. | ||||||||||
|
@@ -70,7 +73,7 @@ To remove all documents, use `scripts/prepdocs.sh --removeall` or `scripts/prepd | |||||||||
|
||||||||||
You can also remove individual documents by using the `--remove` flag. Open either `scripts/prepdocs.sh` or `scripts/prepdocs.ps1` and replace `/data/*` with `/data/YOUR-DOCUMENT-FILENAME-GOES-HERE.pdf`. Then run `scripts/prepdocs.sh --remove` or `scripts/prepdocs.ps1 --remove`. | ||||||||||
|
||||||||||
## Overview of Integrated Vectorization | ||||||||||
## Integrated Vectorization | ||||||||||
|
||||||||||
Azure AI Search includes an [integrated vectorization feature](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers), a cloud-based approach to data ingestion. Integrated vectorization takes care of document format cracking, data extraction, chunking, vectorization, and indexing, all with Azure technologies. | ||||||||||
|
||||||||||
|
@@ -99,3 +102,30 @@ The Azure AI Search indexer will take care of removing those documents from the | |||||||||
### Scheduled indexing | ||||||||||
|
||||||||||
If you would like the indexer to run automatically, you can set it up to [run on a schedule](https://learn.microsoft.com/azure/search/search-howto-schedule-indexers). | ||||||||||
|
||||||||||
## Debugging tips | ||||||||||
pamelafox marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
If you are not sure if a file successfully uploaded, you can query the index from the Azure Portal or from the REST API. Open the index and paste the queries below into the search bar. | ||||||||||
|
||||||||||
To see all the filenames uploaded to the index: | ||||||||||
|
||||||||||
```json | ||||||||||
{ | ||||||||||
"search": "*", | ||||||||||
"count": true, | ||||||||||
"top": 1, | ||||||||||
"facets": ["sourcefile"] | ||||||||||
} | ||||||||||
``` | ||||||||||
|
||||||||||
To search for specific filenames: | ||||||||||
|
||||||||||
```json | ||||||||||
{ | ||||||||||
"search": "*", | ||||||||||
"count": true, | ||||||||||
"top": 1, | ||||||||||
"filter": "sourcefile eq '209884Orig1s000RiskR.pdf'", | ||||||||||
|
"filter": "sourcefile eq '209884Orig1s000RiskR.pdf'", | |
"filter": "sourcefile eq 'employee_handbook.pdf'", |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"filter": "sourcefile eq '209884Orig1s000RiskR.pdf'", | |
"filter": "sourcefile eq 'employee_handbook.pdf'", |
Uh oh!
There was an error while loading. Please reload this page.