Skip to content

Commit bb82e65

Browse files
Merge pull request #114010 from dereklegenzoff/delegenz
Adding tutorial on optimizing indexing and updating guidance on large data sets
2 parents dc333b2 + 4382863 commit bb82e65

File tree

7 files changed

+461
-12
lines changed

7 files changed

+461
-12
lines changed

articles/search/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,8 @@
6060
href: search-semi-structured-data.md
6161
- name: Index multiple Azure data sources
6262
href: tutorial-multiple-data-sources.md
63+
- name: Index any data
64+
href: tutorial-optimize-indexing-push-api.md
6365
- name: Use AI to create content
6466
items:
6567
- name: C#
53.4 KB
Loading
107 KB
Loading
8.57 KB
Loading
77 KB
Loading

articles/search/search-howto-large-index.md

Lines changed: 69 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -3,38 +3,89 @@ title: Index large data set using built-in indexers
33
titleSuffix: Azure Cognitive Search
44
description: Strategies for large data indexing or computationally intensive indexing through batch mode, resourcing, and techniques for scheduled, parallel, and distributed indexing.
55

6-
manager: nitinme
7-
author: HeidiSteen
8-
ms.author: heidist
6+
manager: liamca
7+
author: dereklegenzoff
8+
ms.author: delegenz
99
ms.service: cognitive-search
1010
ms.topic: conceptual
11-
ms.date: 12/17/2019
11+
ms.date: 05/05/2020
1212
---
1313

1414
# How to index large data sets in Azure Cognitive Search
1515

16+
Azure Cognitive Search supports [two basic approaches](search-what-is-data-import.md) for importing data into a search index: *pushing* your data into the index programmatically, or pointing an [Azure Cognitive Search indexer](search-indexer-overview.md) at a supported data source to *pull* in the data.
17+
1618
As data volumes grow or processing needs change, you might find that simple or default indexing strategies are no longer practical. For Azure Cognitive Search, there are several approaches for accommodating larger data sets, ranging from how you structure a data upload request, to using a source-specific indexer for scheduled and distributed workloads.
1719

1820
The same techniques also apply to long-running processes. In particular, the steps outlined in [parallel indexing](#parallel-indexing) are helpful for computationally intensive indexing, such as image analysis or natural language processing in an [AI enrichment pipeline](cognitive-search-concept-intro.md).
1921

20-
The following sections explore three techniques for indexing large amounts of data.
22+
The following sections explore techniques for indexing large amounts of data using both the push API and indexers.
23+
24+
## Push API
25+
26+
When pushing data into an index, there's several key considerations that impact indexing speeds for the push API. These factors are outlined in the section below.
27+
28+
In addition to the information in this article, you can also take advantage of the code samples in the [optimizing indexing speeds tutorial](tutorial-optimize-indexing-push-api.md) to learn more.
29+
30+
### Service tier and number of partitions/replicas
31+
32+
Adding partitions or increasing the tier of your search service will both increase indexing speeds.
33+
34+
Adding additional replicas may also increase indexing speeds but it isn't guaranteed. On the other hand, additional replicas will increase the query volume your search service can handle. Replicas are also a key component for getting an [SLA](https://azure.microsoft.com/support/legal/sla/search/v1_0/).
2135

22-
## Option 1: Pass multiple documents
36+
Before adding partition/replicas or upgrading to a higher tier, consider the monetary cost and allocation time. Adding partitions can significantly increase indexing speed but adding/removing them can take anywhere from 15 minutes to several hours. For more information, see the documentation on [adjusting capacity](search-capacity-planning.md).
2337

24-
One of the simplest mechanisms for indexing a larger data set is to submit multiple documents or records in a single request. As long as the entire payload is under 16 MB, a request can handle up to 1000 documents in a bulk upload operation. These limits apply whether you are using the [Add Documents REST API](https://docs.microsoft.com/rest/api/searchservice/addupdate-or-delete-documents) or the [Index method](https://docs.microsoft.com/dotnet/api/microsoft.azure.search.documentsoperationsextensions.index?view=azure-dotnet) in the .NET SDK. For either API, you would package 1000 documents in the body of each request.
38+
### Index Schema
2539

26-
Batch indexing is implemented for individual requests using REST or .NET, or through indexers. A few indexers operate under different limits. Specifically, Azure Blob indexing sets batch size at 10 documents in recognition of the larger average document size. For indexers based on the [Create Indexer REST API](https://docs.microsoft.com/rest/api/searchservice/Create-Indexer), you can set the `BatchSize` argument to customize this setting to better match the characteristics of your data.
40+
The schema of your index plays an important role in indexing data. Adding fields and adding additional properties to those fields (such as *searchable*, *facetable*, or *filterable*) both reduce indexing speeds.
41+
42+
In general, we recommend only adding additional properties to fields if you intend to use them.
2743

2844
> [!NOTE]
2945
> To keep document size down, avoid adding non-queryable data to an index. Images and other binary data are not directly searchable and shouldn't be stored in the index. To integrate non-queryable data into search results, you should define a non-searchable field that stores a URL reference to the resource.
3046
31-
## Option 2: Add resources
47+
### Batch Size
48+
49+
One of the simplest mechanisms for indexing a larger data set is to submit multiple documents or records in a single request. As long as the entire payload is under 16 MB, a request can handle up to 1000 documents in a bulk upload operation. These limits apply whether you're using the [Add Documents REST API](https://docs.microsoft.com/rest/api/searchservice/addupdate-or-delete-documents) or the [Index method](https://docs.microsoft.com/dotnet/api/microsoft.azure.search.documentsoperationsextensions.index?view=azure-dotnet) in the .NET SDK. For either API, you would package 1000 documents in the body of each request.
50+
51+
Using batches to index documents will significantly improve indexing performance. Determining the optimal batch size for your data is a key component of optimizing indexing speeds. The two primary factors influencing the optimal batch size are:
52+
+ The schema of your index
53+
+ The size of your data
54+
55+
Because the optimal batch size depends on your index and your data, the best approach is to test different batch sizes to determine what results in the fastest indexing speeds for your scenario. This [tutorial](tutorial-optimize-indexing-push-api.md) provides sample code for testing batch sizes using the .NET SDK.
56+
57+
### Number of threads/workers
58+
59+
To take full advantage of Azure Cognitive Search's indexing speeds, you'll likely need to use multiple threads to send batch indexing requests concurrently to the service.
3260

33-
Services that are provisioned at one of the [Standard pricing tiers](search-sku-tier.md) often have underutilized capacity for both storage and workloads (queries or indexing), which makes [increasing the partition and replica counts](search-capacity-planning.md) an obvious solution for accommodating larger data sets. For best results, you need both resources: partitions for storage, and replicas for the data ingestion work.
61+
The optimal number of threads is determined by:
3462

35-
Increasing replicas and partitions are billable events that increase your cost, but unless you are continuously indexing under maximum load, you can add scale for the duration of the indexing process, and then adjust resource levels back downward after indexing is finished.
63+
+ The tier of your search service
64+
+ The number of partitions
65+
+ The size of your batches
66+
+ The schema of your index
3667

37-
## Option 3: Use indexers
68+
You can modify this sample and test with different thread counts to determine the optimal thread count for your scenario. However, as long as you have several threads running concurrently, you should be able to take advantage of most of the efficiency gains.
69+
70+
> [!NOTE]
71+
> As you increase the tier of your search service or increase the partitions, you should also increase the number of concurrent threads.
72+
73+
As you ramp up the requests hitting the search service, you may encounter [HTTP status codes](https://docs.microsoft.com/rest/api/searchservice/http-status-codes) indicating the request didn't fully succeed. During indexing, two common HTTP status codes are:
74+
75+
+ **503 Service Unavailable** - This error means that the system is under heavy load and your request can't be processed at this time.
76+
+ **207 Multi-Status** - This error means that some documents succeeded, but at least one failed.
77+
78+
### Retry strategy
79+
80+
If a failure happens, requests should be retried using an [exponential backoff retry strategy](https://docs.microsoft.com/dotnet/architecture/microservices/implement-resilient-applications/implement-retries-exponential-backoff).
81+
82+
Azure Cognitive Search's .NET SDK automatically retries 503s and other failed requests but you'll need to implement your own logic to retry 207s. Open-source tools such as [Polly](https://github.com/App-vNext/Polly) can also be used to implement a retry strategy.
83+
84+
### Network data transfer speeds
85+
86+
Network data transfer speeds can be a limiting factor when indexing data. Indexing data from within your Azure environment is an easy way to speed up indexing.
87+
88+
## Indexers
3889

3990
[Indexers](search-indexer-overview.md) are used to crawl supported Azure data sources for searchable content. While not specifically intended for large-scale indexing, several indexer capabilities are particularly useful for accommodating larger data sets:
4091

@@ -45,6 +96,12 @@ Increasing replicas and partitions are billable events that increase your cost,
4596
> [!NOTE]
4697
> Indexers are data-source-specific, so using an indexer approach is only viable for selected data sources on Azure: [SQL Database](search-howto-connecting-azure-sql-database-to-azure-search-using-indexers.md), [Blob storage](search-howto-indexing-azure-blob-storage.md), [Table storage](search-howto-indexing-azure-tables.md), [Cosmos DB](search-howto-index-cosmosdb.md).
4798
99+
### Batch Size
100+
101+
As with the push API, indexers allow you to configure the number of items per batch. For indexers based on the [Create Indexer REST API](https://docs.microsoft.com/rest/api/searchservice/Create-Indexer), you can set the `batchSize` argument to customize this setting to better match the characteristics of your data.
102+
103+
Default batch sizes are data source specific. Azure SQL Database and Azure Cosmos DB have a default batch size of 1000. In contrast, Azure Blob indexing sets batch size at 10 documents in recognition of the larger average document size.
104+
48105
### Scheduled indexing
49106

50107
Indexer scheduling is an important mechanism for processing large data sets, as well as slow-running processes like image analysis in a cognitive search pipeline. Indexer processing operates within a 24-hour window. If processing fails to finish within 24 hours, the behaviors of indexer scheduling can work to your advantage.

0 commit comments

Comments
 (0)