Skip to content

Commit c84f909

Browse files
authored
Merge pull request #190020 from HeidiSteen/heidist-work
Updates to large-volume indexing docs
2 parents 86fd101 + f2d0f77 commit c84f909

4 files changed

+32
-82
lines changed

articles/search/search-howto-connecting-azure-sql-database-to-azure-search-using-indexers.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -358,10 +358,6 @@ Yes. However, you need to allow your search service to connect to your database.
358358

359359
Not directly. We do not recommend or support a direct connection, as doing so would require you to open your databases to Internet traffic. Customers have succeeded with this scenario using bridge technologies like Azure Data Factory. For more information, see [Push data to an Azure Cognitive Search index using Azure Data Factory](../data-factory/v1/data-factory-azure-search-connector.md).
360360

361-
**Q: Does running an indexer affect my query workload?**
362-
363-
Yes. Indexer runs on one of the nodes in your search service, and that node’s resources are shared between indexing and serving query traffic and other API requests. If you run intensive indexing and query workloads and encounter a high rate of 503 errors or increasing response times, consider [scaling up your search service](search-capacity-planning.md).
364-
365361
**Q: Can I use a secondary replica in a [failover cluster](../azure-sql/database/auto-failover-group-overview.md) as a data source?**
366362

367363
It depends. For full indexing of a table or view, you can use a secondary replica.

articles/search/search-howto-index-azure-data-lake-storage.md

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -306,18 +306,6 @@ Add the following metadata properties and values to blobs in Blob Storage. When
306306
| `AzureSearch_Skip` |`"true"` |Instructs the blob indexer to completely skip the blob. Neither metadata nor content extraction is attempted. This is useful when a particular blob fails repeatedly and interrupts the indexing process. |
307307
| `AzureSearch_SkipContent` |`"true"` |This is equivalent to the `"dataToExtract" : "allMetadata"` setting described [above](#PartsOfBlobToIndex) scoped to a particular blob. |
308308

309-
## How to index large datasets
310-
311-
Indexing blobs can be a time-consuming process. In cases where you have millions of blobs to index, you can speed up indexing by partitioning your data and using multiple indexers to [process the data in parallel](search-howto-large-index.md#parallel-indexing).
312-
313-
1. Partition your data into multiple blob containers or virtual folders.
314-
315-
1. Set up several data sources, one per container or folder. Use the "query" parameter to specify the partition: `"container" : { "name" : "my-container", "query" : "my-folder" }`.
316-
317-
1. Create one indexer for each data source. Point them to the same target index.
318-
319-
Make sure you have sufficient capacity. One search unit in your service can run one indexer at any given time. Creating multiple indexers is only useful if they can run in parallel.
320-
321309
<a name="DealingWithErrors"></a>
322310

323311
## Handle errors

articles/search/search-howto-indexing-azure-blob-storage.md

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -300,18 +300,6 @@ Add the following metadata properties and values to blobs in Blob Storage. When
300300
| "AzureSearch_Skip" |`"true"` |Instructs the blob indexer to completely skip the blob. Neither metadata nor content extraction is attempted. This is useful when a particular blob fails repeatedly and interrupts the indexing process. |
301301
| "AzureSearch_SkipContent" |`"true"` |This is equivalent to the `"dataToExtract" : "allMetadata"` setting described [above](#PartsOfBlobToIndex) scoped to a particular blob. |
302302

303-
## How to index large datasets
304-
305-
Indexing blobs can be a time-consuming process. In cases where you have millions of blobs to index, you can speed up indexing by partitioning your data and using multiple indexers to [process the data in parallel](search-howto-large-index.md#parallel-indexing).
306-
307-
1. Partition your data into multiple blob containers or virtual folders.
308-
309-
1. Set up several data sources, one per container or folder. Use the "query" parameter to specify the partition: `"container" : { "name" : "my-container", "query" : "my-folder" }`.
310-
311-
1. Create one indexer for each data source. Point them to the same target index.
312-
313-
Make sure you have sufficient capacity. One search unit in your service can run one indexer at any given time. Creating multiple indexers is only useful if they can run in parallel.
314-
315303
<a name="DealingWithErrors"></a>
316304

317305
## Handle errors

articles/search/search-howto-large-index.md

Lines changed: 32 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ author: dereklegenzoff
88
ms.author: delegenz
99
ms.service: cognitive-search
1010
ms.topic: conceptual
11-
ms.date: 01/20/2022
11+
ms.date: 02/28/2022
1212
---
1313

1414
# Index large data sets in Azure Cognitive Search
@@ -17,43 +17,17 @@ Azure Cognitive Search supports [two basic approaches](search-what-is-data-impor
1717

1818
As data volumes grow or processing needs change, you might find that simple or default indexing strategies are no longer practical. For Azure Cognitive Search, there are several approaches for accommodating larger data sets, ranging from how you structure a data upload request, to using a source-specific indexer for scheduled and distributed workloads.
1919

20-
The same techniques also apply to long-running processes. In particular, the steps outlined in [parallel indexing](#parallel-indexing) are helpful for computationally intensive indexing, such as image analysis or natural language processing in an [AI enrichment pipeline](cognitive-search-concept-intro.md).
20+
The same techniques also apply to long-running processes. In particular, the steps outlined in [parallel indexing](#run-indexers-in-parallel) are helpful for computationally intensive indexing, such as image analysis or natural language processing in an [AI enrichment pipeline](cognitive-search-concept-intro.md).
2121

22-
The following sections explain techniques for indexing large amounts of data using both the push API and indexers.For more information and code samples that illustrate push model indexing, see [Tutorial: Optimize indexing speeds](tutorial-optimize-indexing-push-api.md).
22+
The following sections explain techniques for indexing large amounts of data using both the push API and indexers. You should also review [Tips for improving performance](search-performance-tips.md) for more best practices.
2323

24-
## Indexing with the "push" API
24+
For a C# tutorial and code sample, see [Tutorial: Optimize indexing speeds](tutorial-optimize-indexing-push-api.md).
2525

26-
When pushing data into an index using the [Add Documents REST API](/rest/api/searchservice/addupdate-or-delete-documents) or the [IndexDocuments method (.NET)](/dotnet/api/azure.search.documents.searchclient.indexdocuments), there are several key considerations that impact indexing speed. Those factors are outlined in the section below, and range from setting service capacity to code optimizations.
26+
## Indexing large datasets with the "push" API
2727

28-
+ [Index schema](#review-index-schema)
29-
+ [Data location and transfer speed](#check-data-location)
30-
+ [Batch multiple documents per request](#check-the-batch-size)
31-
+ [Service capacity](#check-service-capacity-and-partitions)
32-
+ [Manage threads](#add-threads-and-a-retry-strategy)
28+
When pushing large data volumes into an index using the [Add Documents REST API](/rest/api/searchservice/addupdate-or-delete-documents) or the [IndexDocuments method (Azure SDK for .NET)](/dotnet/api/azure.search.documents.searchclient.indexdocuments), batching documents and managing threads are two techniques that improve indexing speed.
3329

34-
## Review index schema
35-
36-
The schema of your index plays an important role in indexing data. The more fields you have, and the more properties you set (such as *searchable*, *facetable*, or *filterable*), all contribute to increased indexing time.
37-
38-
To keep document size down, avoid adding non-queryable data to an index. Every field that you add to an index should be there for a reason. If you need to integrate non-queryable data such as images into search results, you should define a non-searchable field that stores a URL reference to the resource.
39-
40-
## Check data location
41-
42-
Network data transfer speeds can be a limiting factor when indexing data. Indexing data from within your Azure environment is an easy way to speed up indexing.
43-
44-
## Check service capacity and partitions
45-
46-
1. Review the characteristics and [limits](search-limits-quotas-capacity.md) of the tier at which you provisioned the service. Service tiers differ by the size and speed of partitions, which has a direct impact on indexing speed. If the tier is insufficient for the workload, upgrading might be the easiest and most effective solution for increasing indexing throughput.
47-
48-
1. [Increase the number of partitions](search-capacity-planning.md#add-or-reduce-replicas-and-partitions), even if only on a temporary basis. Partition allocation can be readjusted downwards after an initial indexing run to reduce the overall cost of running the service.
49-
50-
Adding more replicas may also increase indexing speeds but it isn't guaranteed. On the other hand, additional replicas will increase the query volume your search service can handle. Because indexing does not run in the background, increasing query capacity should help overall performance.
51-
52-
> [!NOTE]
53-
> When [adding partition and replicas](search-capacity-planning.md#add-or-reduce-replicas-and-partitions), or provisioning a service at a higher tier, consider the monetary cost and allocation time. Adding partitions can significantly increase indexing speed, but adding and removing them can take anywhere from 15 minutes to several hours.
54-
>
55-
56-
## Check the batch size
30+
### Batch multiple documents per request
5731

5832
One of the simplest mechanisms for indexing a larger data set is to submit multiple documents or records in a single request. As long as the entire payload is under 16 MB, a request can handle up to 1000 documents in a bulk upload operation. These limits apply whether you're using the [Add Documents REST API](/rest/api/searchservice/addupdate-or-delete-documents) or the [IndexDocuments method](/dotnet/api/azure.search.documents.searchclient.indexdocuments) in the .NET SDK. For either API, you would package 1000 documents in the body of each request.
5933

@@ -62,11 +36,11 @@ Using batches to index documents will significantly improve indexing performance
6236
+ The schema of your index
6337
+ The size of your data
6438

65-
Because the optimal batch size depends on your index and your data, the best approach is to test different batch sizes to determine what results in the fastest indexing speeds for your scenario. [Tutorial: Optimize indexing with the push API](tutorial-optimize-indexing-push-api.md) provides sample code for testing batch sizes using the .NET SDK.
39+
Because the optimal batch size depends on your index and your data, the best approach is to test different batch sizes to determine which one results in the fastest indexing speeds for your scenario. [Tutorial: Optimize indexing with the push API](tutorial-optimize-indexing-push-api.md) provides sample code for testing batch sizes using the .NET SDK.
6640

67-
## Add threads and a retry strategy
41+
### Add threads and a retry strategy
6842

69-
In contrast with indexer APIs, when you are using the push APIs to index documents, your application code should ensure there are sufficient threads to make full use of the available capacity.
43+
Indexers have built-in thread management, but when you're using the push APIs, your application code will have to manage threads. Make sure there are sufficient threads to make full use of the available capacity.
7044

7145
1. [Increase the number of threads](tutorial-optimize-indexing-push-api.md#use-multiple-threadsworkers) in your client code. As you increase the tier of your search service or increase the partitions, you should also increase the number of concurrent threads so that you can take full advantage of the new capacity.
7246

@@ -78,25 +52,25 @@ In contrast with indexer APIs, when you are using the push APIs to index documen
7852

7953
1. To handle failures, requests should be retried using an [exponential backoff retry strategy](/dotnet/architecture/microservices/implement-resilient-applications/implement-retries-exponential-backoff).
8054

81-
The Azure .NET SDK automatically retries 503s and other failed requests but you'll need to implement your own logic to retry 207s. Open-source tools such as [Polly](https://github.com/App-vNext/Polly) can also be used to implement a retry strategy.
55+
The Azure .NET SDK automatically retries 503s and other failed requests, but you'll need to implement your own logic to retry 207s. Open-source tools such as [Polly](https://github.com/App-vNext/Polly) can also be used to implement a retry strategy.
8256

83-
## Indexer-based "pull" indexing
57+
## Indexing large datasets with indexers and the "pull" APIs
8458

85-
[Indexers](search-indexer-overview.md) crawl [supported data sources](search-indexer-overview.md#supported-data-sources) for searchable content. While not specifically intended for large-scale indexing, several indexer capabilities are particularly useful for accommodating larger data sets:
59+
[Indexers](search-indexer-overview.md) have built-in capabilities that are particularly useful for accommodating larger data sets:
8660

87-
+ Schedules allow you to parcel out indexing at regular intervals so that you can spread it out over time.
61+
+ Indexer schedules allow you to parcel out indexing at regular intervals so that you can spread it out over time.
8862

89-
+ Scheduled indexing can resume at the last known stopping point. If a data source is not fully crawled within a 24-hour window, the indexer will resume indexing on day two at wherever it left off.
63+
+ Scheduled indexing can resume at the last known stopping point. If a data source isn't fully scanned within a 24-hour window, the indexer will resume indexing on day two at wherever it left off.
9064

9165
+ Partitioning data into smaller individual data sources enables parallel processing. You can break up source data into smaller components, such as into multiple containers in Azure Blob Storage, create a [data source](/rest/api/searchservice/create-data-source) for each partition, and then run multiple indexers in parallel.
9266

93-
### Check indexer batchSize
67+
### Check indexer batch size
9468

9569
As with the push API, indexers allow you to configure the number of items per batch. For indexers based on the [Create Indexer REST API](/rest/api/searchservice/Create-Indexer), you can set the `batchSize` argument to customize this setting to better match the characteristics of your data.
9670

9771
Default batch sizes are data source specific. Azure SQL Database and Azure Cosmos DB have a default batch size of 1000. In contrast, Azure Blob indexing sets batch size at 10 documents in recognition of the larger average document size.
9872

99-
## Scheduled indexers for long-running processes
73+
### Schedule indexers for long-running processes
10074

10175
Indexer scheduling is an important mechanism for processing large data sets, and slow-running processes like image analysis in a cognitive search pipeline. Indexer processing operates within a 24-hour window. If processing fails to finish within 24 hours, the behaviors of indexer scheduling can work to your advantage.
10276

@@ -106,31 +80,35 @@ In practical terms, for index loads spanning several days, you can put the index
10680

10781
<a name="parallel-indexing"></a>
10882

109-
## Parallel indexers
83+
### Run indexers in parallel
11084

111-
If you have partitioned data, you can create indexer-data-source combinations that pull from each data source and write to the same search index. Because each indexer is distinct, you can run them at the same time, populating a search index more quickly than if you ran them sequentially.
85+
If you partition your data, you can create multiple indexer-data-source combinations that pull from each data source and write to the same search index. Because each indexer is distinct, you can run them at the same time, populating a search index more quickly than if you ran them sequentially.
11286

113-
There are some risks associated with parallel indexing. First, recall that indexing does not run in the background, increasing the likelihood that queries will be throttled or dropped. Second, Azure Cognitive Search does not lock the index for updates. Concurrent writes are managed, invoking a retry if a particular write does not succeed on first attempt, but you might notice an increase in indexing failures.
87+
Make sure you have sufficient capacity. One search unit in your service can run one indexer at any given time. Creating multiple indexers is only useful if they can run in parallel.
11488

11589
The number of indexing jobs that can run simultaneously varies for text-based and skills-based indexing. For more information, see [Indexer execution](search-howto-run-reset-indexers.md#indexer-execution).
11690

117-
1. For text-based indexing, [sign in to Azure portal](https://portal.azure.com) and check the number of search units used by your search service. Select **Settings** > **Scale** to view the number at the top of the page. The number of indexers that will run in parallel is approximately equal to the number of search units.
91+
1. [Sign in to Azure portal](https://portal.azure.com) and check the number of search units used by your search service. Select **Settings** > **Scale** to view the number at the top of the page. The number of indexers that will run in parallel is approximately equal to the number of search units.
11892

11993
1. Partition source data among multiple containers or multiple virtual folders inside the same container.
12094

121-
1. Map each partition to its own [data source](/rest/api/searchservice/create-data-source), paired to its own [indexer](/rest/api/searchservice/create-indexer).
95+
1. Create multiple [data sources](/rest/api/searchservice/create-data-source), one for each partition, paired to its own [indexer](/rest/api/searchservice/create-indexer).
12296

12397
1. Specify the same target search index in each indexer.
12498

125-
1. Schedule the indexers. Review indexer status and execution history for confirmation.
99+
1. Schedule the indexers.
100+
101+
1. Review indexer status and execution history for confirmation.
126102

127-
Although multiple indexer-data-source sets can target the same index, be careful of indexer runs that can overwrite existing values in the index. If a second indexer-data-source targets the same documents and fields, any values from the first run will be overwritten. Field values are replaced in full; an indexer cannot merge values from multiple runs into the same field.
103+
There are some risks associated with parallel indexing. First, recall that indexing does not run in the background, increasing the likelihood that queries will be throttled or dropped.
128104

129-
If you are pulling from different data source types, a challenge for this scenario lies in designing an index schema that works for all incoming data, and a document key structure that is uniform in the search index. Natively, the values that uniquely identify a document are metadata_storage_path in a blob container and a primary key in a SQL table. You can imagine that one or both sources must be amended to provide key values in a common format, regardless of content origin. For this scenario, you should expect to perform some level of pre-processing to homogenize the data so that it can be pulled into a single index.
105+
Second, Azure Cognitive Search does not lock the index for updates. Concurrent writes are managed, invoking a retry if a particular write does not succeed on first attempt, but you might notice an increase in indexing failures.
106+
107+
Although multiple indexer-data-source sets can target the same index, be careful of indexer runs that can overwrite existing values in the index. If a second indexer-data-source targets the same documents and fields, any values from the first run will be overwritten. Field values are replaced in full; an indexer can't merge values from multiple runs into the same field.
130108

131109
## See also
132110

133-
+ [Indexer overview](search-indexer-overview.md)
134-
+ [Create an indexer](search-howto-create-indexers.md)
135-
+ [Monitor indexer status](search-howto-monitor-indexers.md)
111+
+ [Tips for improving performance](search-performance-tips.md)
136112
+ [Performance analysis](search-performance-analysis.md)
113+
+ [Indexer overview](search-indexer-overview.md)
114+
+ [Monitor indexer status](search-howto-monitor-indexers.md)

0 commit comments

Comments
 (0)