Skip to content

Commit b66026f

Browse files
committed
Refresh articles
1 parent e5cbc4f commit b66026f

File tree

5 files changed

+113
-109
lines changed

5 files changed

+113
-109
lines changed

.openpublishing.redirection.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,11 @@
1919
"source_path_from_root": "/articles/search/search-howto-index-csv-blobs.md",
2020
"redirect_url": "/azure/search/search-how-to-index-csv-blobs",
2121
"redirect_document_id": false
22+
},
23+
{
24+
"source_path_from_root": "/articles/search/search-howto-large-index.md",
25+
"redirect_url": "/azure/search/search-how-to-large-index",
26+
"redirect_document_id": false
2227
}
2328
]
2429
}
Lines changed: 30 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: Index large data sets for full text search
33
titleSuffix: Azure AI Search
4-
description: Strategies for large data indexing or computationally intensive indexing through batch mode, resourcing, and scheduled, parallel, and distributed indexing.
4+
description: Learn about strategies for large data indexing or computationally intensive indexing through batch mode, resourcing, and scheduled, parallel, and distributed indexing.
55

66
manager: nitinme
77
author: HeidiSteen
@@ -10,14 +10,14 @@ ms.service: azure-ai-search
1010
ms.custom:
1111
- ignite-2023
1212
ms.topic: conceptual
13-
ms.date: 04/01/2024
13+
ms.date: 10/24/2024
1414
---
1515

1616
# Index large data sets in Azure AI Search
1717

1818
If your search solution requirements include indexing big data or complex data, this article articulates strategies for accommodating long running processes on Azure AI Search.
1919

20-
This article assumes familiarity with the [two basic approaches for importing data](search-what-is-data-import.md): pushing data into an index, or pulling in data from a supported data source using a [search indexer](search-indexer-overview.md). If your scenario involves computationally intensive [AI enrichment](cognitive-search-concept-intro.md), then indexers are required, given the skillset dependency on indexers.
20+
This article assumes familiarity with the [two basic approaches for importing data](search-what-is-data-import.md): *pushing* data into an index, or *pulling* in data from a supported data source using a [search indexer](search-indexer-overview.md). If your scenario involves computationally intensive [AI enrichment](cognitive-search-concept-intro.md), then indexers are required, given the skillset dependency on indexers.
2121

2222
This article complements [Tips for better performance](search-performance-tips.md), which offers best practices on index and query design. A well-designed index that includes only the fields and attributes you need is an important prerequisite for large-scale indexing.
2323

@@ -28,45 +28,45 @@ We recommend using a newer search service, created after April 3, 2024, for [hig
2828
2929
## Index large data using the push APIs
3030

31-
"Push" APIs, such as [Documents Index REST API](/rest/api/searchservice/documents) or the [IndexDocuments method (Azure SDK for .NET)](/dotnet/api/azure.search.documents.searchclient.indexdocuments), are the most prevalent form of indexing in Azure AI Search. For solutions that use a push API, the strategy for long-running indexing will have one or both of the following components:
31+
*Push* APIs, such as [Documents Index REST API](/rest/api/searchservice/documents) or the [IndexDocuments method (Azure SDK for .NET)](/dotnet/api/azure.search.documents.searchclient.indexdocuments), are the most prevalent form of indexing in Azure AI Search. For solutions that use a push API, the strategy for long-running indexing has one or both of the following components:
3232

3333
+ Batching documents
3434
+ Managing threads
3535

3636
### Batch multiple documents per request
3737

38-
A simple mechanism for indexing a large quantity of data is to submit multiple documents or records in a single request. As long as the entire payload is under 16 MB, a request can handle up to 1000 documents in a bulk upload operation. These limits apply whether you're using the [Documents Index REST API](/rest/api/searchservice/documents) or the [IndexDocuments method](/dotnet/api/azure.search.documents.searchclient.indexdocuments) in the .NET SDK. For either API, you would package 1000 documents in the body of each request.
38+
A simple mechanism for indexing a large quantity of data is to submit multiple documents or records in a single request. As long as the entire payload is under 16 MB, a request can handle up to 1,000 documents in a bulk upload operation. These limits apply whether you're using the [Documents Index REST API](/rest/api/searchservice/documents) or the [IndexDocuments method](/dotnet/api/azure.search.documents.searchclient.indexdocuments) in the .NET SDK. Using either API, you can package 1,000 documents in the body of each request.
3939

40-
Batching documents will significantly shorten the amount of time it takes to work through a large data volume. Determining the optimal batch size for your data is a key component of optimizing indexing speeds. The two primary factors influencing the optimal batch size are:
40+
Batching documents significantly shortens the amount of time it takes to work through a large data volume. Determining the optimal batch size for your data is a key component of optimizing indexing speeds. The two primary factors influencing the optimal batch size are:
4141

4242
+ The schema of your index
4343
+ The size of your data
4444

45-
Because the optimal batch size depends on your index and your data, the best approach is to test different batch sizes to determine which one results in the fastest indexing speeds for your scenario. [Tutorial: Optimize indexing with the push API](tutorial-optimize-indexing-push-api.md) provides sample code for testing batch sizes using the .NET SDK.
45+
Because the optimal batch size depends on your index and your data, the best approach is to test different batch sizes to determine which one results in the fastest indexing speeds for your scenario. For sample code to test batch sizes using the .NET SDK, see [Tutorial: Optimize indexing with the push API](tutorial-optimize-indexing-push-api.md).
4646

47-
### Add threads and a retry strategy
47+
### Manage threads and a retry strategy
4848

49-
Indexers have built-in thread management, but when you're using the push APIs, your application code will have to manage threads. Make sure there are sufficient threads to make full use of the available capacity, especially if you've recently increased partitions or have upgraded to a higher tier search service.
49+
Indexers have built-in thread management, but when you're using the push APIs, your application code needs to manage threads. Make sure there are sufficient threads to make full use of the available capacity, especially if you've recently increased partitions or upgraded to a higher tier search service.
5050

5151
1. [Increase the number of concurrent threads](tutorial-optimize-indexing-push-api.md#use-multiple-threadsworkers) in your client code.
5252

5353
1. As you ramp up the requests hitting the search service, you might encounter [HTTP status codes](/rest/api/searchservice/http-status-codes) indicating the request didn't fully succeed. During indexing, two common HTTP status codes are:
5454

55-
+ **503 Service Unavailable** - This error means that the system is under heavy load and your request can't be processed at this time.
55+
+ **503 Service Unavailable**: This error means that the system is under heavy load and your request can't be processed at this time.
5656

57-
+ **207 Multi-Status** - This error means that some documents succeeded, but at least one failed.
57+
+ **207 Multi-Status**: This error means that some documents succeeded, but at least one failed.
5858

5959
1. To handle failures, requests should be retried using an [exponential backoff retry strategy](/dotnet/architecture/microservices/implement-resilient-applications/implement-retries-exponential-backoff).
6060

61-
The Azure .NET SDK automatically retries 503s and other failed requests, but you'll need to implement your own logic to retry 207s. Open-source tools such as [Polly](https://github.com/App-vNext/Polly) can also be used to implement a retry strategy.
61+
The Azure .NET SDK automatically retries 503s and other failed requests, but you need to implement your own logic to retry 207s. Open-source tools such as [Polly](https://github.com/App-vNext/Polly) can also be used to implement a retry strategy.
6262

63-
## Index with indexers and the "pull" APIs
63+
## Index with indexers and the pull APIs
6464

6565
[Indexers](search-indexer-overview.md) have several capabilities that are useful for long-running processes:
6666

6767
+ Batching documents
6868
+ Parallel indexing over partitioned data
69-
+ Scheduling and change detection for indexing just new and change documents over time
69+
+ Scheduling and change detection for indexing only new and changed documents over time
7070

7171
Indexer schedules can resume processing at the last known stopping point. If data isn't fully indexed within the processing window, the indexer picks up wherever it left off on the next run, assuming you're using a data source that provides change detection.
7272

@@ -76,13 +76,13 @@ Partitioning data into smaller individual data sources enables parallel processi
7676

7777
As with the push API, indexers allow you to configure the number of items per batch. For indexers based on the [Create Indexer REST API](/rest/api/searchservice/indexers/create), you can set the `batchSize` argument to customize this setting to better match the characteristics of your data.
7878

79-
Default batch sizes are data source specific. Azure SQL Database and Azure Cosmos DB have a default batch size of 1000. In contrast, Azure Blob indexing sets batch size at 10 documents in recognition of the larger average document size.
79+
Default batch sizes are data-source specific. Azure SQL Database and Azure Cosmos DB have a default batch size of 1,000. In contrast, Azure Blob indexing sets batch size at 10 documents in recognition of the larger average document size.
8080

8181
### Schedule indexers for long-running processes
8282

8383
Indexer scheduling is an important mechanism for processing large data sets and for accommodating slow-running processes like image analysis in an enrichment pipeline.
8484

85-
Typically, indexer processing runs within a 2-hour window. If the indexing workload takes days rather than hours to complete, you can put the indexer on a consecutive, recurring schedule that starts every two hours. Assuming the data source has [change tracking enabled](search-howto-create-indexers.md#change-detection-and-internal-state), the indexer will resume processing where it last left off. At this cadence, an indexer can work its way through a document backlog over a series of days until all unprocessed documents are processed.
85+
Typically, indexer processing runs within a two-hour window. If the indexing workload takes days rather than hours to complete, you can put the indexer on a consecutive, recurring schedule that starts every two hours. Assuming the data source has [change tracking enabled](search-howto-create-indexers.md#change-detection-and-internal-state), the indexer resumes processing where it last left off. At this cadence, an indexer can work its way through a document backlog over a series of days until all unprocessed documents are processed.
8686

8787
```json
8888
{
@@ -92,12 +92,12 @@ Typically, indexer processing runs within a 2-hour window. If the indexing workl
9292
}
9393
```
9494

95-
When there are no longer any new or updated documents in the data source, indexer execution history will report `0/0` documents processed, and no processing occurs.
95+
When there are no longer any new or updated documents in the data source, indexer execution history reports `0/0` documents processed, and no processing occurs.
9696

97-
For more information about setting schedules, see [Create Indexer REST API](/rest/api/searchservice/indexers/create) or see [How to schedule indexers for Azure AI Search](search-howto-schedule-indexers.md).
97+
For more information about setting schedules, see [Create Indexer REST API](/rest/api/searchservice/indexers/create) or see [Schedule indexers for Azure AI Search](search-howto-schedule-indexers.md).
9898

9999
> [!NOTE]
100-
> Some indexers that run on an older runtime architecture have a 24-hour rather than 2-hour maximum processing window. The 2-hour limit is for newer content processors that run in an [internally managed multi-tenant environment](search-indexer-securing-resources.md#indexer-execution-environment). Whenever possible, Azure AI Search tries to offload indexer and skillset processing to the multi-tenant environment. If the indexer can't be migrated, it will run in the private environment and it can run for as long as 24 hours. If you're scheduling an indexer that exhibits these characteristics, assume a 24 hour processing window.
100+
> Some indexers that run on an older runtime architecture have a 24-hour rather than 2-hour maximum processing window. The two-hour limit is for newer content processors that run in an [internally managed multi-tenant environment](search-indexer-securing-resources.md#indexer-execution-environment). Whenever possible, Azure AI Search tries to offload indexer and skillset processing to the multi-tenant environment. If the indexer can't be migrated, it runs in the private environment and it can run for as long as 24 hours. If you're scheduling an indexer that exhibits these characteristics, assume a 24-hour processing window.
101101
102102
<a name="parallel-indexing"></a>
103103

@@ -109,9 +109,9 @@ Make sure you have sufficient capacity. One search unit in your service can run
109109

110110
The number of indexing jobs that can run simultaneously varies for text-based and skills-based indexing. For more information, see [Indexer execution](search-howto-run-reset-indexers.md#indexer-execution).
111111

112-
If your data source is an [Azure Blob Storage container](/azure/storage/blobs/storage-blobs-introduction#containers) or [Azure Data Lake Storage Gen 2](/azure/storage/blobs/storage-blobs-introduction#about-azure-data-lake-storage-gen2), enumerating a large number of blobs can take a long time (even hours) until this operation is completed. This will cause that your indexer's documents succeeded count isn't increased during that time and it may seem it's not making any progress, when it is. If you would like document processing to go faster for a large number of blobs, consider partitioning your data into multiple containers and create parallel indexers pointing to a single index.
112+
If your data source is an [Azure Blob Storage container](/azure/storage/blobs/storage-blobs-introduction#containers) or [Azure Data Lake Storage Gen 2](/azure/storage/blobs/storage-blobs-introduction#about-azure-data-lake-storage-gen2), enumerating a large number of blobs can take a long time (even hours) until this operation is completed. As a result, your indexer's *documents succeeded* count doesn't appear to increase during that time and it might seem it's not making any progress, when it is. If you would like document processing to go faster for a large number of blobs, consider partitioning your data into multiple containers and create parallel indexers pointing to a single index.
113113

114-
1. Sign in to the [Azure portal](https://portal.azure.com) and check the number of search units used by your search service. Select **Settings** > **Scale** to view the number at the top of the page. The number of indexers that will run in parallel is approximately equal to the number of search units.
114+
1. Sign in to the [Azure portal](https://portal.azure.com) and check the number of search units used by your search service. Select **Settings** > **Scale** to view the number at the top of the page. The number of indexers that run in parallel is approximately equal to the number of search units.
115115

116116
1. Partition source data among multiple containers or multiple virtual folders inside the same container.
117117

@@ -123,21 +123,21 @@ If your data source is an [Azure Blob Storage container](/azure/storage/blobs/st
123123

124124
1. Review indexer status and execution history for confirmation.
125125

126-
There are some risks associated with parallel indexing. First, recall that indexing doesn't run in the background, increasing the likelihood that queries will be throttled or dropped.
126+
There are some risks associated with parallel indexing. First, recall that indexing doesn't run in the background, increasing the likelihood that queries are throttled or dropped.
127127

128128
Second, Azure AI Search doesn't lock the index for updates. Concurrent writes are managed, invoking a retry if a particular write doesn't succeed on first attempt, but you might notice an increase in indexing failures.
129129

130-
Although multiple indexer-data-source sets can target the same index, be careful of indexer runs that can overwrite existing values in the index. If a second indexer-data-source targets the same documents and fields, any values from the first run will be overwritten. Field values are replaced in full; an indexer can't merge values from multiple runs into the same field.
130+
Although multiple indexer-data-source sets can target the same index, be careful of indexer runs that can overwrite existing values in the index. If a second indexer-data-source targets the same documents and fields, any values from the first run are overwritten. Field values are replaced in full; an indexer can't merge values from multiple runs into the same field.
131131

132132
## Index big data on Spark
133133

134134
If you have a big data architecture and your data is on a Spark cluster, we recommend [SynapseML for loading and indexing data](search-synapseml-cognitive-services.md). The tutorial includes steps for calling Azure AI services for AI enrichment, but you can also use the AzureSearchWriter API for text indexing.
135135

136-
## See also
136+
## Related content
137137

138-
+ [Tutorial: Optimize indexing workloads](tutorial-optimize-indexing-push-api.md)
139-
+ [Tutorial: Index at scale using SynapseML and Apache Spark](search-synapseml-cognitive-services.md)
140-
+ [Tips for improving performance](search-performance-tips.md)
141-
+ [Performance analysis](search-performance-analysis.md)
142-
+ [Indexer overview](search-indexer-overview.md)
143-
+ [Monitor indexer status](search-howto-monitor-indexers.md)
138+
+ [Tutorial: Optimize indexing by using the push API](tutorial-optimize-indexing-push-api.md)
139+
+ [Tutorial: Index large data from Apache Spark using SynapseML and Azure AI Search](search-synapseml-cognitive-services.md)
140+
+ [Tips for better performance in Azure AI Search](search-performance-tips.md)
141+
+ [Analyze performance in Azure AI Search](search-performance-analysis.md)
142+
+ [Indexers in Azure AI Search](search-indexer-overview.md)
143+
+ [Monitor indexer status and results in Azure AI Search](search-howto-monitor-indexers.md)

0 commit comments

Comments
 (0)