Skip to content

Commit eb4b4e2

Browse files
authored
Rework docs for cloud ingestion feature (#2826)
* Add link to cloud ingestion guide * Update docs * Fix the links
1 parent 3395382 commit eb4b4e2

File tree

2 files changed

+34
-34
lines changed

2 files changed

+34
-34
lines changed

docs/data_ingestion.md

Lines changed: 27 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,8 @@ The chat app provides two ways to ingest data: manual ingestion and cloud ingest
1414
- [Indexing additional documents](#indexing-additional-documents)
1515
- [Removing documents](#removing-documents)
1616
- [Cloud ingestion](#cloud-ingestion)
17-
- [Custom skills pipeline](#custom-skills-pipeline)
17+
- [Enabling cloud ingestion](#enabling-cloud-ingestion)
18+
- [Indexer architecture](#indexer-architecture)
1819
- [Indexing of additional documents](#indexing-of-additional-documents)
1920
- [Removal of documents](#removal-of-documents)
2021
- [Scheduled indexing](#scheduled-indexing)
@@ -136,11 +137,33 @@ You can also remove individual documents by using the `--remove` flag. Open eith
136137

137138
This project includes an optional feature to perform data ingestion in the cloud using Azure Functions as custom skills for Azure AI Search indexers. This approach offloads the ingestion workload from your local machine to the cloud, allowing for more scalable and efficient processing of large datasets.
138139

139-
You must first explicitly [enable cloud ingestion](./deploy_features.md#enabling-cloud-ingestion) in the `azd` environment to use this feature.
140+
### Enabling cloud ingestion
140141

141-
This feature cannot be used on existing index. You need to create a new index or drop and recreate an existing index. In the newly created index schema, a new field 'parent_id' is added. This is used internally by the indexer to manage life cycle of chunks.
142+
1. If you've previously deployed, delete the existing search index or create a new index. This feature cannot be used on existing index. In the newly created index schema, a new field 'parent_id' is added. This is used internally by the indexer to manage life cycle of chunks. Run this command to set a new index name:
142143

143-
### Custom skills pipeline
144+
```shell
145+
azd env set AZURE_SEARCH_INDEX cloudindex
146+
```
147+
148+
2. Run this command:
149+
150+
```shell
151+
azd env set USE_CLOUD_INGESTION true
152+
```
153+
154+
3. Open `azure.yaml` and un-comment the document-extractor, figure-processor, and text-processor sections. Those are the Azure Functions apps that will be deployed and serve as Azure AI Search skills.
155+
156+
4. Provision the new Azure Functions resources, deploy the function apps, and update the search indexer with:
157+
158+
```shell
159+
azd up
160+
```
161+
162+
5. That will upload the documents in the `data/` folder to the Blob storage container, create the indexer and skillset, and run the indexer to ingest the data. You can monitor the indexer status from the portal.
163+
164+
6. When you have new documents to ingest, you can upload documents to the Blob storage container and run the indexer from the Azure Portal to ingest new documents.
165+
166+
### Indexer architecture
144167

145168
The cloud ingestion pipeline uses four Azure Functions as custom skills within an Azure AI Search indexer. Each function corresponds to a stage in the ingestion process. Here's how it works:
146169

docs/deploy_features.md

Lines changed: 7 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ You should typically enable these features before running `azd up`. Once you've
88
* [Using different embedding models](#using-different-embedding-models)
99
* [Enabling multimodal embeddings and answering](#enabling-multimodal-embeddings-and-answering)
1010
* [Enabling media description with Azure Content Understanding](#enabling-media-description-with-azure-content-understanding)
11+
* [Enabling cloud data ingestion](#enabling-cloud-data-ingestion)
1112
* [Enabling client-side chat history](#enabling-client-side-chat-history)
1213
* [Enabling persistent chat history with Azure Cosmos DB](#enabling-persistent-chat-history-with-azure-cosmos-db)
1314
* [Enabling language picker](#enabling-language-picker)
@@ -256,6 +257,12 @@ first [remove the existing documents](./data_ingestion.md#removing-documents) an
256257
⚠️ This feature does not yet support DOCX, PPTX, or XLSX formats. If you have figures in those formats, they will be ignored.
257258
Convert them first to PDF or image formats to enable media description.
258259
260+
## Enabling cloud data ingestion
261+
262+
By default, this project runs a local script in order to ingest data. Once you move beyond the sample documents, you may want to enable [cloud ingestion](./data_ingestion.md#cloud-ingestion), which uses Azure AI Search indexers and custom Azure AI Search skills based off the same code used by the local ingestion. That approach scales better to larger amounts of data.
263+
264+
Learn more in the [cloud ingestion guide](./data_ingestion.md#cloud-ingestion).
265+
259266
## Enabling client-side chat history
260267
261268
[📺 Watch: (RAG Deep Dive series) Storing chat history](https://www.youtube.com/watch?v=1YiTFnnLVIA)
@@ -322,36 +329,6 @@ Alternatively you can use the browser's built-in [Speech Synthesis API](https://
322329
azd env set USE_SPEECH_OUTPUT_BROWSER true
323330
```
324331
325-
## Enabling cloud data ingestion
326-
327-
By default, this project runs a local script in order to ingest data. Once you move beyond the sample documents, you may want cloud ingestion, which uses Azure AI Search indexers and custom Azure AI Search skills based off the same code used by the local ingestion. That approach scales better to larger amounts of data.
328-
329-
To enable cloud ingestion:
330-
331-
1. If you've previously deployed, delete the existing search index or create a new index using:
332-
333-
```shell
334-
azd env set AZURE_SEARCH_INDEX cloudindex
335-
```
336-
337-
2. Run this command:
338-
339-
```shell
340-
azd env set USE_CLOUD_INGESTION true
341-
```
342-
343-
3. Open `azure.yaml` and un-comment the document-extractor, figure-processor, and text-processor sections. Those are the Azure Functions apps that will be deployed and serve as Azure AI Search skills.
344-
345-
4. Provision the new Azure Functions resources, deploy the function apps, and update the search indexer with:
346-
347-
```shell
348-
azd up
349-
```
350-
351-
5. That will upload the documents in the `data/` folder to the Blob storage container, create the indexer and skillset, and run the indexer to ingest the data. You can monitor the indexer status from the portal.
352-
353-
6. When you have new documents to ingest, you can upload documents to the Blob storage container and run the indexer from the Azure Portal to ingest new documents.
354-
355332
## Enabling authentication
356333
357334
By default, the deployed Azure web app will have no authentication or access restrictions enabled, meaning anyone with routable network access to the web app can chat with your indexed data. If you'd like to automatically setup authentication and user login as part of the `azd up` process, see [this guide](./login_and_acl.md).

0 commit comments

Comments
 (0)