Skip to content

Commit bc75292

Browse files
committed
Move formats again
1 parent 67c126f commit bc75292

File tree

1 file changed

+16
-16
lines changed

1 file changed

+16
-16
lines changed

docs/data_ingestion.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ The [azure-search-openai-demo](/) project can set up a full RAG chat app on Azur
44

55
The chat app provides two ways to ingest data: manual indexing and integrated vectorization. This document explains the differences between the two approaches and provides an overview of the manual indexing process.
66

7+
- [Supported document formats](#supported-document-formats)
78
- [Manual indexing process](#manual-indexing-process)
8-
- [Supported document formats](#supported-document-formats)
99
- [Chunking](#chunking)
1010
- [Categorizing data for enhanced search](#enhancing-search-functionality-with-data-categorization)
1111
- [Indexing additional documents](#indexing-additional-documents)
@@ -16,22 +16,9 @@ The chat app provides two ways to ingest data: manual indexing and integrated ve
1616
- [Scheduled indexing](#scheduled-indexing)
1717
- [Debugging tips](#debugging-tips)
1818

19-
## Manual indexing process
20-
21-
The [`prepdocs.py`](../app/backend/prepdocs.py) script is responsible for both uploading and indexing documents. The typical usage is to call it using `scripts/prepdocs.sh` (Mac/Linux) or `scripts/prepdocs.ps1` (Windows), as these scripts will set up a Python virtual environment and pass in the required parameters based on the current `azd` environment. You can pass additional arguments directly to the script, for example `scripts/prepdocs.ps1 --removeall`. Whenever `azd up` or `azd provision` is run, the script is called automatically.
22-
23-
![Diagram of the indexing process](images/diagram_prepdocs.png)
24-
25-
The script uses the following steps to index documents:
26-
27-
1. If it doesn't yet exist, create a new index in Azure AI Search.
28-
2. Upload the PDFs to Azure Blob Storage.
29-
3. Split the PDFs into chunks of text.
30-
4. Upload the chunks to Azure AI Search. If using vectors (the default), also compute the embeddings and upload those alongside the text.
31-
32-
### Supported document formats
19+
## Supported document formats
3320

34-
In order to ingest a document format, we need a tool that can turn it into text. By default, use Azure Document Intelligence (DI in the table below), but we also have local parsers for several formats. The local parsers are not as sophisticated as Azure Document Intelligence, but they can be used to decrease charges.
21+
In order to ingest a document format, we need a tool that can turn it into text. By default, the manual indexing uses Azure Document Intelligence (DI in the table below), but we also have local parsers for several formats. The local parsers are not as sophisticated as Azure Document Intelligence, but they can be used to decrease charges.
3522

3623
| Format | Manual indexing | Integrated Vectorization |
3724
| ------ | ------------------------------------ | ------------------------ |
@@ -45,6 +32,19 @@ In order to ingest a document format, we need a tool that can turn it into text.
4532

4633
The Blob indexer used by the Integrated Vectorization approach also supports a few [additional formats](https://learn.microsoft.com/azure/search/search-howto-indexing-azure-blob-storage#supported-document-formats).
4734

35+
## Manual indexing process
36+
37+
The [`prepdocs.py`](../app/backend/prepdocs.py) script is responsible for both uploading and indexing documents. The typical usage is to call it using `scripts/prepdocs.sh` (Mac/Linux) or `scripts/prepdocs.ps1` (Windows), as these scripts will set up a Python virtual environment and pass in the required parameters based on the current `azd` environment. You can pass additional arguments directly to the script, for example `scripts/prepdocs.ps1 --removeall`. Whenever `azd up` or `azd provision` is run, the script is called automatically.
38+
39+
![Diagram of the indexing process](images/diagram_prepdocs.png)
40+
41+
The script uses the following steps to index documents:
42+
43+
1. If it doesn't yet exist, create a new index in Azure AI Search.
44+
2. Upload the PDFs to Azure Blob Storage.
45+
3. Split the PDFs into chunks of text.
46+
4. Upload the chunks to Azure AI Search. If using vectors (the default), also compute the embeddings and upload those alongside the text.
47+
4848
### Chunking
4949

5050
We're often asked why we need to break up the PDFs into chunks when Azure AI Search supports searching large documents.

0 commit comments

Comments
 (0)