Skip to content

Commit f2d0f77

Browse files
committed
removed H2s from blob indexing
1 parent 0384e13 commit f2d0f77

File tree

3 files changed

+16
-39
lines changed

3 files changed

+16
-39
lines changed

articles/search/search-howto-index-azure-data-lake-storage.md

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -306,18 +306,6 @@ Add the following metadata properties and values to blobs in Blob Storage. When
306306
| `AzureSearch_Skip` |`"true"` |Instructs the blob indexer to completely skip the blob. Neither metadata nor content extraction is attempted. This is useful when a particular blob fails repeatedly and interrupts the indexing process. |
307307
| `AzureSearch_SkipContent` |`"true"` |This is equivalent to the `"dataToExtract" : "allMetadata"` setting described [above](#PartsOfBlobToIndex) scoped to a particular blob. |
308308

309-
## How to index large datasets
310-
311-
Indexing blobs can be a time-consuming process. In cases where you have millions of blobs to index, you can speed up indexing by partitioning your data and using multiple indexers to [process the data in parallel](search-howto-large-index.md#parallel-indexing).
312-
313-
1. Partition your data into multiple blob containers or virtual folders.
314-
315-
1. Set up several data sources, one per container or folder. Use the "query" parameter to specify the partition: `"container" : { "name" : "my-container", "query" : "my-folder" }`.
316-
317-
1. Create one indexer for each data source. Point them to the same target index.
318-
319-
Make sure you have sufficient capacity. One search unit in your service can run one indexer at any given time. Creating multiple indexers is only useful if they can run in parallel.
320-
321309
<a name="DealingWithErrors"></a>
322310

323311
## Handle errors

articles/search/search-howto-indexing-azure-blob-storage.md

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -300,18 +300,6 @@ Add the following metadata properties and values to blobs in Blob Storage. When
300300
| "AzureSearch_Skip" |`"true"` |Instructs the blob indexer to completely skip the blob. Neither metadata nor content extraction is attempted. This is useful when a particular blob fails repeatedly and interrupts the indexing process. |
301301
| "AzureSearch_SkipContent" |`"true"` |This is equivalent to the `"dataToExtract" : "allMetadata"` setting described [above](#PartsOfBlobToIndex) scoped to a particular blob. |
302302

303-
## How to index large datasets
304-
305-
Indexing blobs can be a time-consuming process. In cases where you have millions of blobs to index, you can speed up indexing by partitioning your data and using multiple indexers to [process the data in parallel](search-howto-large-index.md#parallel-indexing).
306-
307-
1. Partition your data into multiple blob containers or virtual folders.
308-
309-
1. Set up several data sources, one per container or folder. Use the "query" parameter to specify the partition: `"container" : { "name" : "my-container", "query" : "my-folder" }`.
310-
311-
1. Create one indexer for each data source. Point them to the same target index.
312-
313-
Make sure you have sufficient capacity. One search unit in your service can run one indexer at any given time. Creating multiple indexers is only useful if they can run in parallel.
314-
315303
<a name="DealingWithErrors"></a>
316304

317305
## Handle errors

articles/search/search-howto-large-index.md

Lines changed: 16 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,7 @@ For a C# tutorial and code sample, see [Tutorial: Optimize indexing speeds](tuto
2525

2626
## Indexing large datasets with the "push" API
2727

28-
When pushing large data volumes into an index using the [Add Documents REST API](/rest/api/searchservice/addupdate-or-delete-documents) or the [IndexDocuments method (.NET)](/dotnet/api/azure.search.documents.searchclient.indexdocuments), batching documents and managing threads are two techniques that improve indexing speed.
29-
30-
+ [Batch multiple documents per request](#batch-multiple-documents-per-request)
31-
+ [Manage threads](#add-threads-and-a-retry-strategy)
28+
When pushing large data volumes into an index using the [Add Documents REST API](/rest/api/searchservice/addupdate-or-delete-documents) or the [IndexDocuments method (Azure SDK for .NET)](/dotnet/api/azure.search.documents.searchclient.indexdocuments), batching documents and managing threads are two techniques that improve indexing speed.
3229

3330
### Batch multiple documents per request
3431

@@ -39,11 +36,11 @@ Using batches to index documents will significantly improve indexing performance
3936
+ The schema of your index
4037
+ The size of your data
4138

42-
Because the optimal batch size depends on your index and your data, the best approach is to test different batch sizes to determine what results in the fastest indexing speeds for your scenario. [Tutorial: Optimize indexing with the push API](tutorial-optimize-indexing-push-api.md) provides sample code for testing batch sizes using the .NET SDK.
39+
Because the optimal batch size depends on your index and your data, the best approach is to test different batch sizes to determine which one results in the fastest indexing speeds for your scenario. [Tutorial: Optimize indexing with the push API](tutorial-optimize-indexing-push-api.md) provides sample code for testing batch sizes using the .NET SDK.
4340

4441
### Add threads and a retry strategy
4542

46-
In contrast with indexer APIs, when you're using the push APIs to index documents, your application code should ensure there are sufficient threads to make full use of the available capacity.
43+
Indexers have built-in thread management, but when you're using the push APIs, your application code will have to manage threads. Make sure there are sufficient threads to make full use of the available capacity.
4744

4845
1. [Increase the number of threads](tutorial-optimize-indexing-push-api.md#use-multiple-threadsworkers) in your client code. As you increase the tier of your search service or increase the partitions, you should also increase the number of concurrent threads so that you can take full advantage of the new capacity.
4946

@@ -55,11 +52,11 @@ In contrast with indexer APIs, when you're using the push APIs to index document
5552

5653
1. To handle failures, requests should be retried using an [exponential backoff retry strategy](/dotnet/architecture/microservices/implement-resilient-applications/implement-retries-exponential-backoff).
5754

58-
The Azure .NET SDK automatically retries 503s and other failed requests but you'll need to implement your own logic to retry 207s. Open-source tools such as [Polly](https://github.com/App-vNext/Polly) can also be used to implement a retry strategy.
55+
The Azure .NET SDK automatically retries 503s and other failed requests, but you'll need to implement your own logic to retry 207s. Open-source tools such as [Polly](https://github.com/App-vNext/Polly) can also be used to implement a retry strategy.
5956

6057
## Indexing large datasets with indexers and the "pull" APIs
6158

62-
[Indexers](search-indexer-overview.md) connect to [supported data sources](search-indexer-overview.md#supported-data-sources) for indexing searchable content. While not specifically intended for large-scale indexing, several indexer capabilities are particularly useful for accommodating larger data sets:
59+
[Indexers](search-indexer-overview.md) have built-in capabilities that are particularly useful for accommodating larger data sets:
6360

6461
+ Indexer schedules allow you to parcel out indexing at regular intervals so that you can spread it out over time.
6562

@@ -85,25 +82,29 @@ In practical terms, for index loads spanning several days, you can put the index
8582

8683
### Run indexers in parallel
8784

88-
If you have partitioned data, you can create indexer-data-source combinations that pull from each data source and write to the same search index. Because each indexer is distinct, you can run them at the same time, populating a search index more quickly than if you ran them sequentially.
85+
If you partition your data, you can create multiple indexer-data-source combinations that pull from each data source and write to the same search index. Because each indexer is distinct, you can run them at the same time, populating a search index more quickly than if you ran them sequentially.
8986

90-
There are some risks associated with parallel indexing. First, recall that indexing does not run in the background, increasing the likelihood that queries will be throttled or dropped. Second, Azure Cognitive Search does not lock the index for updates. Concurrent writes are managed, invoking a retry if a particular write does not succeed on first attempt, but you might notice an increase in indexing failures.
87+
Make sure you have sufficient capacity. One search unit in your service can run one indexer at any given time. Creating multiple indexers is only useful if they can run in parallel.
9188

9289
The number of indexing jobs that can run simultaneously varies for text-based and skills-based indexing. For more information, see [Indexer execution](search-howto-run-reset-indexers.md#indexer-execution).
9390

94-
1. For text-based indexing, [sign in to Azure portal](https://portal.azure.com) and check the number of search units used by your search service. Select **Settings** > **Scale** to view the number at the top of the page. The number of indexers that will run in parallel is approximately equal to the number of search units.
91+
1. [Sign in to Azure portal](https://portal.azure.com) and check the number of search units used by your search service. Select **Settings** > **Scale** to view the number at the top of the page. The number of indexers that will run in parallel is approximately equal to the number of search units.
9592

9693
1. Partition source data among multiple containers or multiple virtual folders inside the same container.
9794

98-
1. Map each partition to its own [data source](/rest/api/searchservice/create-data-source), paired to its own [indexer](/rest/api/searchservice/create-indexer).
95+
1. Create multiple [data sources](/rest/api/searchservice/create-data-source), one for each partition, paired to its own [indexer](/rest/api/searchservice/create-indexer).
9996

10097
1. Specify the same target search index in each indexer.
10198

102-
1. Schedule the indexers. Review indexer status and execution history for confirmation.
99+
1. Schedule the indexers.
103100

104-
Although multiple indexer-data-source sets can target the same index, be careful of indexer runs that can overwrite existing values in the index. If a second indexer-data-source targets the same documents and fields, any values from the first run will be overwritten. Field values are replaced in full; an indexer can't merge values from multiple runs into the same field.
101+
1. Review indexer status and execution history for confirmation.
102+
103+
There are some risks associated with parallel indexing. First, recall that indexing does not run in the background, increasing the likelihood that queries will be throttled or dropped.
105104

106-
If you're pulling from different data source types, a challenge for this scenario lies in designing an index schema that works for all incoming data, and a document key structure that is uniform in the search index. Natively, the values that uniquely identify a document are metadata_storage_path in a blob container and a primary key in a SQL table. You can imagine that one or both sources must be amended to provide key values in a common format, regardless of content origin. For this scenario, you should expect to perform some level of pre-processing to homogenize the data so that it can be pulled into a single index.
105+
Second, Azure Cognitive Search does not lock the index for updates. Concurrent writes are managed, invoking a retry if a particular write does not succeed on first attempt, but you might notice an increase in indexing failures.
106+
107+
Although multiple indexer-data-source sets can target the same index, be careful of indexer runs that can overwrite existing values in the index. If a second indexer-data-source targets the same documents and fields, any values from the first run will be overwritten. Field values are replaced in full; an indexer can't merge values from multiple runs into the same field.
107108

108109
## See also
109110

0 commit comments

Comments
 (0)