Skip to content

Commit cb42ea0

Browse files
authored
Merge pull request #224166 from HeidiSteen/heidist-work
Revise index large data
2 parents d0a8ea1 + acb37a4 commit cb42ea0

File tree

1 file changed

+42
-23
lines changed

1 file changed

+42
-23
lines changed

articles/search/search-howto-large-index.md

Lines changed: 42 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,39 @@
11
---
2-
title: Index large data set using built-in indexers
2+
title: Index large data sets for full text search
33
titleSuffix: Azure Cognitive Search
4-
description: Strategies for large data indexing or computationally intensive indexing through batch mode, resourcing, and techniques for scheduled, parallel, and distributed indexing.
4+
description: Strategies for large data indexing or computationally intensive indexing through batch mode, resourcing, and scheduled, parallel, and distributed indexing.
55

66
manager: nitinme
77
author: HeidiSteen
88
ms.author: heidist
99
ms.service: cognitive-search
1010
ms.topic: conceptual
11-
ms.date: 12/10/2022
11+
ms.date: 01/17/2023
1212
---
1313

1414
# Index large data sets in Azure Cognitive Search
1515

16-
Azure Cognitive Search supports [two basic approaches](search-what-is-data-import.md) for importing data into a search index. You can *push* your data into the index programmatically, or point an [Azure Cognitive Search indexer](search-indexer-overview.md) at a supported data source to *pull* in the data.
16+
If your search solution requirements include indexing big data or complex data, this article describes the strategies for accommodating long running processes on Azure Cognitive Search.
1717

18-
As data volumes grow or processing needs change, you might find that simple indexing strategies are no longer practical. For Azure Cognitive Search, there are several approaches for accommodating larger data sets, ranging from how you structure a data upload request, to using a source-specific indexer for scheduled and distributed workloads.
19-
20-
The same techniques also apply to long-running processes. In particular, the steps outlined in [parallel indexing](#run-indexers-in-parallel) are helpful for computationally intensive indexing, such as image analysis or natural language processing in an [AI enrichment pipeline](cognitive-search-concept-intro.md).
18+
This article assumes familiarity with the [two basic approaches for importing data](search-what-is-data-import.md): pushing data into an index, or pulling in data from a supported data source using a [search indexer](search-indexer-overview.md). The strategy you choose will be determined by the indexing approach you're already using. If your scenario involves computationally intensive [AI enrichment](cognitive-search-concept-intro.md), then your strategy must include indexers, given the skillset dependency on indexers.
2119

22-
The following sections explain techniques for indexing large amounts of data for both push and pull approaches. You should also review [Tips for improving performance](search-performance-tips.md) for more best practices.
20+
This article complements [Tips for better performance](search-performance-tips.md), which offers best practices on index and query design. A well-designed index that includes only the fields and attributes you need is an important prerequisite for large-scale indexing.
2321

24-
For C# tutorials, code samples, and alternative strategies, see:
22+
> [!NOTE]
23+
> The strategies described in this article assume a single large data source. If your solution requires indexing from multiple data sources, see [Index multiple data sources in Azure Cognitive Search](https://github.com/Azure-Samples/azure-cognitive-search-multiple-containers-indexer/blob/main/README.md) for a recommended approach.
2524
26-
+ [Tutorial: Optimize indexing speeds](tutorial-optimize-indexing-push-api.md)
27-
+ [Tutorial: Index at scale using SynapseML and Apache Spark](search-synapseml-cognitive-services.md)
25+
## Index large data using the push APIs
2826

29-
## Indexing large datasets with the "push" API
27+
"Push" APIs, such as [Add Documents REST API](/rest/api/searchservice/addupdate-or-delete-documents) or the [IndexDocuments method (Azure SDK for .NET)](/dotnet/api/azure.search.documents.searchclient.indexdocuments), are the most prevalent form of indexing in Cognitive Search. For solutions that use a push API, the strategy for long-running indexing will have one or both of the following components:
3028

31-
When pushing large data volumes into an index using the [Add Documents REST API](/rest/api/searchservice/addupdate-or-delete-documents) or the [IndexDocuments method (Azure SDK for .NET)](/dotnet/api/azure.search.documents.searchclient.indexdocuments), batching documents and managing threads are two techniques that improve indexing speed.
29+
+ Batching documents
30+
+ Managing threads
3231

3332
### Batch multiple documents per request
3433

35-
One of the simplest mechanisms for indexing a larger data set is to submit multiple documents or records in a single request. As long as the entire payload is under 16 MB, a request can handle up to 1000 documents in a bulk upload operation. These limits apply whether you're using the [Add Documents REST API](/rest/api/searchservice/addupdate-or-delete-documents) or the [IndexDocuments method](/dotnet/api/azure.search.documents.searchclient.indexdocuments) in the .NET SDK. For either API, you would package 1000 documents in the body of each request.
34+
A simple mechanism for indexing a large quantity of data is to submit multiple documents or records in a single request. As long as the entire payload is under 16 MB, a request can handle up to 1000 documents in a bulk upload operation. These limits apply whether you're using the [Add Documents REST API](/rest/api/searchservice/addupdate-or-delete-documents) or the [IndexDocuments method](/dotnet/api/azure.search.documents.searchclient.indexdocuments) in the .NET SDK. For either API, you would package 1000 documents in the body of each request.
3635

37-
Using batches to index documents will significantly improve indexing performance. Determining the optimal batch size for your data is a key component of optimizing indexing speeds. The two primary factors influencing the optimal batch size are:
36+
Batching documents will significantly shorten the amount of time it takes to work through a large data volume. Determining the optimal batch size for your data is a key component of optimizing indexing speeds. The two primary factors influencing the optimal batch size are:
3837

3938
+ The schema of your index
4039
+ The size of your data
@@ -57,15 +56,17 @@ Indexers have built-in thread management, but when you're using the push APIs, y
5756

5857
The Azure .NET SDK automatically retries 503s and other failed requests, but you'll need to implement your own logic to retry 207s. Open-source tools such as [Polly](https://github.com/App-vNext/Polly) can also be used to implement a retry strategy.
5958

60-
## Indexing large datasets with indexers and the "pull" APIs
59+
## Index with indexers and the "pull" APIs
6160

62-
[Indexers](search-indexer-overview.md) have built-in capabilities that are particularly useful for accommodating larger data sets:
61+
[Indexers](search-indexer-overview.md) have several capabilities that are useful for long-running processes:
6362

64-
+ Indexer schedules allow you to parcel out indexing at regular intervals so that you can spread it out over time.
63+
+ Batching documents
64+
+ Parallel indexing over partitioned data
65+
+ Scheduling and change detection for indexing just new and change documents over time
6566

66-
+ Scheduled indexing can resume at the last known stopping point. If a data source isn't fully scanned within the processing window, the indexer picks up wherever it left off at the last job.
67+
Indexer schedules can resume processing at the last known stopping point. If data isn't fully indexed within the processing window, the indexer picks up wherever it left off on the next run, assuming you're using a data source that provides change detection.
6768

68-
+ Partitioning data into smaller individual data sources enables parallel processing. You can break up source data into smaller components, such as into multiple containers in Azure Blob Storage, create a [data source](/rest/api/searchservice/create-data-source) for each partition, and then run multiple indexers in parallel.
69+
Partitioning data into smaller individual data sources enables parallel processing. You can break up source data, such as into multiple containers in Azure Blob Storage, create a [data source](/rest/api/searchservice/create-data-source) for each partition, and then [run the indexers in parallel](search-howto-run-reset-indexers.md), subject to the number of search units of your search service.
6970

7071
### Check indexer batch size
7172

@@ -92,7 +93,7 @@ When there are no longer any new or updated documents in the data source, indexe
9293
For more information about setting schedules, see [Create Indexer REST API](/rest/api/searchservice/Create-Indexer) or see [How to schedule indexers for Azure Cognitive Search](search-howto-schedule-indexers.md).
9394

9495
> [!NOTE]
95-
> Some indexers that run on an older runtime architecture have a 24-hour rather than 2-hour maximum processing window. The 2-hour limit is for newer content processors that run in an [internally managed multi-tenant environment](search-indexer-securing-resources.md#indexer-execution-environment). Whenever possible, Azure Cognitive Search tries to offload indexer and skillset processing to the multi-tenant environment. If the indexer can't be migrated, it will run in the private environment and it can run for as long as 24 hours. If you're scheduling an indexer that fits these characteristics, assume a 24 hour processing window.
96+
> Some indexers that run on an older runtime architecture have a 24-hour rather than 2-hour maximum processing window. The 2-hour limit is for newer content processors that run in an [internally managed multi-tenant environment](search-indexer-securing-resources.md#indexer-execution-environment). Whenever possible, Azure Cognitive Search tries to offload indexer and skillset processing to the multi-tenant environment. If the indexer can't be migrated, it will run in the private environment and it can run for as long as 24 hours. If you're scheduling an indexer that exhibits these characteristics, assume a 24 hour processing window.
9697
9798
<a name="parallel-indexing"></a>
9899

@@ -114,7 +115,7 @@ If your data source is an [Azure Blob Storage container](../storage/blobs/storag
114115

115116
1. Specify the same target search index in each indexer.
116117

117-
1. Schedule the indexers.
118+
1. Schedule the indexers.
118119

119120
1. Review indexer status and execution history for confirmation.
120121

@@ -124,9 +125,27 @@ Second, Azure Cognitive Search doesn't lock the index for updates. Concurrent wr
124125

125126
Although multiple indexer-data-source sets can target the same index, be careful of indexer runs that can overwrite existing values in the index. If a second indexer-data-source targets the same documents and fields, any values from the first run will be overwritten. Field values are replaced in full; an indexer can't merge values from multiple runs into the same field.
126127

128+
## Index big data on Spark
129+
130+
If you have a big data architecture and your data is on a Spark cluster, we recommend [SynapseML for loading and indexing data](search-synapseml-cognitive-services.md). The tutorial includes steps for calling Cognitive Services for AI enrichment, but you can also use the AzureSearchWriter API for text indexing.
131+
127132
## See also
128133

129134
+ [Tips for improving performance](search-performance-tips.md)
130135
+ [Performance analysis](search-performance-analysis.md)
131136
+ [Indexer overview](search-indexer-overview.md)
132-
+ [Monitor indexer status](search-howto-monitor-indexers.md)
137+
+ [Monitor indexer status](search-howto-monitor-indexers.md)
138+
139+
140+
<!-- Azure Cognitive Search supports [two basic approaches](search-what-is-data-import.md) for importing data into a search index. You can *push* your data into the index programmatically, or point an [Azure Cognitive Search indexer](search-indexer-overview.md) at a supported data source to *pull* in the data.
141+
142+
As data volumes grow or processing needs change, you might find that simple indexing strategies are no longer practical. For Azure Cognitive Search, there are several approaches for accommodating larger data sets, ranging from how you structure a data upload request, to using a source-specific indexer for scheduled and distributed workloads.
143+
144+
The same techniques used for long-running processes. In particular, the steps outlined in [parallel indexing](#run-indexers-in-parallel) are helpful for computationally intensive indexing, such as image analysis or natural language processing in an [AI enrichment pipeline](cognitive-search-concept-intro.md).
145+
146+
The following sections explain techniques for indexing large amounts of data for both push and pull approaches. You should also review [Tips for improving performance](search-performance-tips.md) for more best practices.
147+
148+
For C# tutorials, code samples, and alternative strategies, see:
149+
150+
+ [Tutorial: Optimize indexing workloads](tutorial-optimize-indexing-push-api.md)
151+
+ [Tutorial: Index at scale using SynapseML and Apache Spark](search-synapseml-cognitive-services.md) -->

0 commit comments

Comments
 (0)