Rework docs for cloud ingestion feature (#2826)

pamelafox · web-flow · commit eb4b4e288d29 · 2025-11-12T11:32:29.000-08:00
* Add link to cloud ingestion guide

* Update docs

* Fix the links
diff --git a/docs/data_ingestion.md b/docs/data_ingestion.md
@@ -14,7 +14,8 @@ The chat app provides two ways to ingest data: manual ingestion and cloud ingest
   - [Indexing additional documents](#indexing-additional-documents)
   - [Removing documents](#removing-documents)
 - [Cloud ingestion](#cloud-ingestion)
-  - [Custom skills pipeline](#custom-skills-pipeline)
+  - [Enabling cloud ingestion](#enabling-cloud-ingestion)
+  - [Indexer architecture](#indexer-architecture)
   - [Indexing of additional documents](#indexing-of-additional-documents)
   - [Removal of documents](#removal-of-documents)
   - [Scheduled indexing](#scheduled-indexing)
@@ -136,11 +137,33 @@ You can also remove individual documents by using the `--remove` flag. Open eith
 
 This project includes an optional feature to perform data ingestion in the cloud using Azure Functions as custom skills for Azure AI Search indexers. This approach offloads the ingestion workload from your local machine to the cloud, allowing for more scalable and efficient processing of large datasets.
 
-You must first explicitly [enable cloud ingestion](./deploy_features.md#enabling-cloud-ingestion) in the `azd` environment to use this feature.
+### Enabling cloud ingestion
 
-This feature cannot be used on existing index. You need to create a new index or drop and recreate an existing index. In the newly created index schema, a new field 'parent_id' is added. This is used internally by the indexer to manage life cycle of chunks.
+1. If you've previously deployed, delete the existing search index or create a new index. This feature cannot be used on existing index. In the newly created index schema, a new field 'parent_id' is added. This is used internally by the indexer to manage life cycle of chunks. Run this command to set a new index name:
 
-### Custom skills pipeline
+    ```shell
+    azd env set AZURE_SEARCH_INDEX cloudindex
+    ```
+
+2. Run this command:
+
+    ```shell
+    azd env set USE_CLOUD_INGESTION true
+    ```
+
+3. Open `azure.yaml` and un-comment the document-extractor, figure-processor, and text-processor sections. Those are the Azure Functions apps that will be deployed and serve as Azure AI Search skills.
+
+4. Provision the new Azure Functions resources, deploy the function apps, and update the search indexer with:
+
+    ```shell
+    azd up
+    ```
+
+5. That will upload the documents in the `data/` folder to the Blob storage container, create the indexer and skillset, and run the indexer to ingest the data. You can monitor the indexer status from the portal.
+
+6. When you have new documents to ingest, you can upload documents to the Blob storage container and run the indexer from the Azure Portal to ingest new documents.
+
+### Indexer architecture
 
 The cloud ingestion pipeline uses four Azure Functions as custom skills within an Azure AI Search indexer. Each function corresponds to a stage in the ingestion process. Here's how it works:
 
diff --git a/docs/deploy_features.md b/docs/deploy_features.md
@@ -8,6 +8,7 @@ You should typically enable these features before running `azd up`. Once you've
 * [Using different embedding models](#using-different-embedding-models)
 * [Enabling multimodal embeddings and answering](#enabling-multimodal-embeddings-and-answering)
 * [Enabling media description with Azure Content Understanding](#enabling-media-description-with-azure-content-understanding)
+* [Enabling cloud data ingestion](#enabling-cloud-data-ingestion)
 * [Enabling client-side chat history](#enabling-client-side-chat-history)
 * [Enabling persistent chat history with Azure Cosmos DB](#enabling-persistent-chat-history-with-azure-cosmos-db)
 * [Enabling language picker](#enabling-language-picker)
@@ -256,6 +257,12 @@ first [remove the existing documents](./data_ingestion.md#removing-documents) an
 ⚠️ This feature does not yet support DOCX, PPTX, or XLSX formats. If you have figures in those formats, they will be ignored.
 Convert them first to PDF or image formats to enable media description.
 
+## Enabling cloud data ingestion
+
+By default, this project runs a local script in order to ingest data. Once you move beyond the sample documents, you may want to enable [cloud ingestion](./data_ingestion.md#cloud-ingestion), which uses Azure AI Search indexers and custom Azure AI Search skills based off the same code used by the local ingestion. That approach scales better to larger amounts of data.
+
+Learn more in the [cloud ingestion guide](./data_ingestion.md#cloud-ingestion).
+
 ## Enabling client-side chat history
 
 [📺 Watch: (RAG Deep Dive series) Storing chat history](https://www.youtube.com/watch?v=1YiTFnnLVIA)
@@ -322,36 +329,6 @@ Alternatively you can use the browser's built-in [Speech Synthesis API](https://
 azd env set USE_SPEECH_OUTPUT_BROWSER true
 ```
 
-## Enabling cloud data ingestion
-
-By default, this project runs a local script in order to ingest data. Once you move beyond the sample documents, you may want cloud ingestion, which uses Azure AI Search indexers and custom Azure AI Search skills based off the same code used by the local ingestion. That approach scales better to larger amounts of data.
-
-To enable cloud ingestion:
-
-1. If you've previously deployed, delete the existing search index or create a new index using:
-
-    ```shell
-    azd env set AZURE_SEARCH_INDEX cloudindex
-    ```
-
-2. Run this command:
-
-    ```shell
-    azd env set USE_CLOUD_INGESTION true
-    ```
-
-3. Open `azure.yaml` and un-comment the document-extractor, figure-processor, and text-processor sections. Those are the Azure Functions apps that will be deployed and serve as Azure AI Search skills.
-
-4. Provision the new Azure Functions resources, deploy the function apps, and update the search indexer with:
-
-    ```shell
-    azd up
-    ```
-
-5. That will upload the documents in the `data/` folder to the Blob storage container, create the indexer and skillset, and run the indexer to ingest the data. You can monitor the indexer status from the portal.
-
-6. When you have new documents to ingest, you can upload documents to the Blob storage container and run the indexer from the Azure Portal to ingest new documents.
-
 ## Enabling authentication
 
 By default, the deployed Azure web app will have no authentication or access restrictions enabled, meaning anyone with routable network access to the web app can chat with your indexed data. If you'd like to automatically setup authentication and user login as part of the `azd up` process, see [this guide](./login_and_acl.md).