You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/data_ingestion.md
+16-16Lines changed: 16 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,8 +4,8 @@ The [azure-search-openai-demo](/) project can set up a full RAG chat app on Azur
4
4
5
5
The chat app provides two ways to ingest data: manual indexing and integrated vectorization. This document explains the differences between the two approaches and provides an overview of the manual indexing process.
@@ -16,22 +16,9 @@ The chat app provides two ways to ingest data: manual indexing and integrated ve
16
16
-[Scheduled indexing](#scheduled-indexing)
17
17
-[Debugging tips](#debugging-tips)
18
18
19
-
## Manual indexing process
20
-
21
-
The [`prepdocs.py`](../app/backend/prepdocs.py) script is responsible for both uploading and indexing documents. The typical usage is to call it using `scripts/prepdocs.sh` (Mac/Linux) or `scripts/prepdocs.ps1` (Windows), as these scripts will set up a Python virtual environment and pass in the required parameters based on the current `azd` environment. You can pass additional arguments directly to the script, for example `scripts/prepdocs.ps1 --removeall`. Whenever `azd up` or `azd provision` is run, the script is called automatically.
22
-
23
-

24
-
25
-
The script uses the following steps to index documents:
26
-
27
-
1. If it doesn't yet exist, create a new index in Azure AI Search.
28
-
2. Upload the PDFs to Azure Blob Storage.
29
-
3. Split the PDFs into chunks of text.
30
-
4. Upload the chunks to Azure AI Search. If using vectors (the default), also compute the embeddings and upload those alongside the text.
31
-
32
-
### Supported document formats
19
+
## Supported document formats
33
20
34
-
In order to ingest a document format, we need a tool that can turn it into text. By default, use Azure Document Intelligence (DI in the table below), but we also have local parsers for several formats. The local parsers are not as sophisticated as Azure Document Intelligence, but they can be used to decrease charges.
21
+
In order to ingest a document format, we need a tool that can turn it into text. By default, the manual indexing uses Azure Document Intelligence (DI in the table below), but we also have local parsers for several formats. The local parsers are not as sophisticated as Azure Document Intelligence, but they can be used to decrease charges.
35
22
36
23
| Format | Manual indexing | Integrated Vectorization |
@@ -45,6 +32,19 @@ In order to ingest a document format, we need a tool that can turn it into text.
45
32
46
33
The Blob indexer used by the Integrated Vectorization approach also supports a few [additional formats](https://learn.microsoft.com/azure/search/search-howto-indexing-azure-blob-storage#supported-document-formats).
47
34
35
+
## Manual indexing process
36
+
37
+
The [`prepdocs.py`](../app/backend/prepdocs.py) script is responsible for both uploading and indexing documents. The typical usage is to call it using `scripts/prepdocs.sh` (Mac/Linux) or `scripts/prepdocs.ps1` (Windows), as these scripts will set up a Python virtual environment and pass in the required parameters based on the current `azd` environment. You can pass additional arguments directly to the script, for example `scripts/prepdocs.ps1 --removeall`. Whenever `azd up` or `azd provision` is run, the script is called automatically.
38
+
39
+

40
+
41
+
The script uses the following steps to index documents:
42
+
43
+
1. If it doesn't yet exist, create a new index in Azure AI Search.
44
+
2. Upload the PDFs to Azure Blob Storage.
45
+
3. Split the PDFs into chunks of text.
46
+
4. Upload the chunks to Azure AI Search. If using vectors (the default), also compute the embeddings and upload those alongside the text.
47
+
48
48
### Chunking
49
49
50
50
We're often asked why we need to break up the PDFs into chunks when Azure AI Search supports searching large documents.
0 commit comments