Skip to content

Commit 16ea0ce

Browse files
Merge pull request #266546 from HeidiSteen/heidist-fresh
[azure search] Feb freshness pass #1
2 parents 7e95504 + e5e397a commit 16ea0ce

15 files changed

+131
-115
lines changed

articles/search/cognitive-search-incremental-indexing-conceptual.md

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -8,27 +8,32 @@ ms.service: cognitive-search
88
ms.custom:
99
- ignite-2023
1010
ms.topic: conceptual
11-
ms.date: 04/21/2023
11+
ms.date: 02/16/2024
1212
---
1313

1414
# Incremental enrichment and caching in Azure AI Search
1515

1616
> [!IMPORTANT]
1717
> This feature is in public preview under [supplemental terms of use](https://azure.microsoft.com/support/legal/preview-supplemental-terms/). The [preview REST API](/rest/api/searchservice/index-preview) supports this feature.
1818
19-
*Incremental enrichment* refers to the use of cached enrichments during [skillset execution](cognitive-search-working-with-skillsets.md) so that only new and changed skills and documents incur AI processing. The cache contains the output from [document cracking](search-indexer-overview.md#document-cracking), plus the outputs of each skill for every document. Although caching is billable (it uses Azure Storage), the overall cost of enrichment is reduced because the costs of storage are less than image extraction and AI processing.
19+
*Incremental enrichment* refers to the use of cached enrichments during [skillset execution](cognitive-search-working-with-skillsets.md) so that only new and changed skills and documents incur AI processing charges. The cache contains the output from [document cracking](search-indexer-overview.md#document-cracking), plus the outputs of each skill for every document. Although caching is billable (it uses Azure Storage), the overall cost of enrichment is reduced because the costs of storage are less than image extraction and AI processing.
2020

2121
When you enable caching, the indexer evaluates your updates to determine whether existing enrichments can be pulled from the cache. Image and text content from the document cracking phase, plus skill outputs that are upstream or orthogonal to your edits, are likely to be reusable.
2222

23-
After performing the incremental enrichments as indicated by the skillset update, refreshed results are written back to the cache, and also to the search index or knowledge store.
23+
After skillset processing is finished, the refreshed results are written back to the cache, and also to the search index or knowledge store.
24+
25+
## Limitations
26+
27+
> [!CAUTION]
28+
> If you're using the [SharePoint Online indexer (Preview)](search-howto-index-sharepoint-online.md), you should avoid incremental enrichment. Under certain circumstances, the cache becomes invalid, requiring an [indexer reset and run](search-howto-run-reset-indexers.md), should you choose to reload it.
2429
2530
## Cache configuration
2631

27-
Physically, the cache is stored in a blob container in your Azure Storage account, one per indexer. Each indexer is assigned a unique and immutable cache identifier that corresponds to the container it is using.
32+
Physically, the cache is stored in a blob container in your Azure Storage account, one per indexer. Each indexer is assigned a unique and immutable cache identifier that corresponds to the container it's using.
2833

29-
The cache is created when you specify the "cache" property and run the indexer. Only enriched content can be cached. If your indexer does not have an attached skillset, then caching does not apply.
34+
The cache is created when you specify the "cache" property and run the indexer. Only enriched content can be cached. If your indexer doesn't have an attached skillset, then caching doesn't apply.
3035

31-
The following example illustrates an indexer with caching enabled. See [Enable enrichment caching](search-howto-incremental-index.md) for full instructions. Notice that when adding the cache property, use preview API version, 2020-06-30-Preview or later, on the request.
36+
The following example illustrates an indexer with caching enabled. See [Enable enrichment caching](search-howto-incremental-index.md) for full instructions. Notice that when adding the cache property, use a [preview API version](/rest/api/searchservice/search-service-api-versions#preview-versions), 2020-06-30-Preview or later, on the request.
3237

3338
```json
3439
POST https://[search service name].search.windows.net/indexers?api-version=2020-06-30-Preview
@@ -49,7 +54,7 @@ POST https://[search service name].search.windows.net/indexers?api-version=2020-
4954

5055
## Cache management
5156

52-
The lifecycle of the cache is managed by the indexer. If an indexer is deleted, its cache is also deleted. If the "cache" property on the indexer is set to null or the connection string is changed, the existing cache is deleted on the next indexer run.
57+
The lifecycle of the cache is managed by the indexer. If an indexer is deleted, its cache is also deleted. If the `cache` property on the indexer is set to null or the connection string is changed, the existing cache is deleted on the next indexer run.
5358

5459
While incremental enrichment is designed to detect and respond to changes with no intervention on your part, there are parameters you can use to invoke specific behaviors:
5560

@@ -62,17 +67,17 @@ While incremental enrichment is designed to detect and respond to changes with n
6267

6368
### Prioritize new documents
6469

65-
The "cache" property includes an "enableReprocessing" parameter. It's used to control processing over incoming documents already represented in the cache. When true (default), documents already in the cache are reprocessed when you rerun the indexer, assuming your skill update affects that doc.
70+
The cache property includes an `enableReprocessing` parameter. It's used to control processing over incoming documents already represented in the cache. When true (default), documents already in the cache are reprocessed when you rerun the indexer, assuming your skill update affects that doc.
6671

67-
When false, existing documents are not reprocessed, effectively prioritizing new, incoming content over existing content. You should only set "enableReprocessing" to false on a temporary basis. Having "enableReprocessing" set to true most of the time ensures that all documents, both new and existing, are valid per the current skillset definition.
72+
When false, existing documents aren't reprocessed, effectively prioritizing new, incoming content over existing content. You should only set enableReprocessing to false on a temporary basis. Having enableReprocessing set to true most of the time ensures that all documents, both new and existing, are valid per the current skillset definition.
6873

6974
<a name="Bypass-skillset-checks"></a>
7075

7176
### Bypass skillset evaluation
7277

73-
Modifying a skill and reprocessing of that skill typically go hand in hand. However, some changes to a skill should not result in reprocessing (for example, deploying a custom skill to a new location or with a new access key). Most likely, these are peripheral modifications that have no genuine impact on the substance of the skill output itself.
78+
Modifying a skill and reprocessing of that skill typically go hand in hand. However, some changes to a skill shouldn't result in reprocessing (for example, deploying a custom skill to a new location or with a new access key). Most likely, these are peripheral modifications that have no genuine impact on the substance of the skill output itself.
7479

75-
If you know that a change to the skill is indeed superficial, you should override skill evaluation by setting the "disableCacheReprocessingChangeDetection" parameter to true:
80+
If you know that a change to the skill is indeed superficial, you should override skill evaluation by setting the `disableCacheReprocessingChangeDetection` parameter to true:
7681

7782
1. Call [Update Skillset](/rest/api/searchservice/update-skillset) and modify the skillset definition.
7883
1. Append the "disableCacheReprocessingChangeDetection=true" parameter on the request.
@@ -89,7 +94,7 @@ PUT https://[servicename].search.windows.net/skillsets/[skillset name]?api-versi
8994

9095
### Bypass data source validation checks
9196

92-
Most changes to a data source definition will invalidate the cache. However, for scenarios where you know that a change should not invalidate the cache - such as changing a connection string or rotating the key on the storage account - append the "ignoreResetRequirement" parameter on the [data source update](/rest/api/searchservice/update-data-source). Setting this parameter to true allows the commit to go through, without triggering a reset condition that would result in all objects being rebuilt and populated from scratch.
97+
Most changes to a data source definition will invalidate the cache. However, for scenarios where you know that a change shouldn't invalidate the cache - such as changing a connection string or rotating the key on the storage account - append the `ignoreResetRequirement` parameter on the [data source update](/rest/api/searchservice/update-data-source). Setting this parameter to true allows the commit to go through, without triggering a reset condition that would result in all objects being rebuilt and populated from scratch.
9398

9499
```http
95100
PUT https://[search service].search.windows.net/datasources/[data source name]?api-version=2020-06-30-Preview&ignoreResetRequirement
@@ -135,7 +140,7 @@ POST https://[search service name].search.windows.net/indexers/[indexer name]/re
135140

136141
Once you enable a cache, the indexer evaluates changes in your pipeline composition to determine which content can be reused and which needs reprocessing. This section enumerates changes that invalidate the cache outright, followed by changes that trigger incremental processing.
137142

138-
An invalidating change is one where the entire cache is no longer valid. An example of an invalidating change is one where your data source is updated. Here is the complete list of changes to any part of the indexer pipeline that would invalidate your cache:
143+
An invalidating change is one where the entire cache is no longer valid. An example of an invalidating change is one where your data source is updated. Here's the complete list of changes to any part of the indexer pipeline that would invalidate your cache:
139144

140145
+ Changing the data source type
141146
+ Changing data source container
@@ -155,7 +160,7 @@ An invalidating change is one where the entire cache is no longer valid. An exam
155160

156161
## Changes that trigger incremental processing
157162

158-
Incremental processing evaluates your skillset definition and determines which skills to rerun, selectively updating the affected portions of the document tree. Here is the complete list of changes resulting in incremental enrichment:
163+
Incremental processing evaluates your skillset definition and determines which skills to rerun, selectively updating the affected portions of the document tree. Here's the complete list of changes resulting in incremental enrichment:
159164

160165
+ Changing the skill type (the OData type of the skill is updated)
161166
+ Skill-specific parameters are updated, for example a URL, defaults, or other parameters
@@ -176,12 +181,7 @@ REST API version `2020-06-30-Preview` or later provides incremental enrichment t
176181

177182
+ [Reset Skills (api-version=2020-06-30)](/rest/api/searchservice/preview-api/reset-skills)
178183

179-
+ [Update Data Source](/rest/api/searchservice/update-data-source), when called with a preview API version, provides a new parameter named "ignoreResetRequirement", which should be set to true when your update action should not invalidate the cache. Use "ignoreResetRequirement" sparingly as it could lead to unintended inconsistency in your data that will not be detected easily.
180-
181-
## Limitations
182-
183-
> [!CAUTION]
184-
> If you're using the [SharePoint Online indexer (Preview)](search-howto-index-sharepoint-online.md), you should avoid incremental enrichment. Under certain circumstances, the cache becomes invalid, requiring an [indexer reset and run](search-howto-run-reset-indexers.md), should you choose to reload it.
184+
+ [Update Data Source](/rest/api/searchservice/update-data-source), when called with a preview API version, provides a new parameter named "ignoreResetRequirement", which should be set to true when your update action shouldn't invalidate the cache. Use "ignoreResetRequirement" sparingly as it could lead to unintended inconsistency in your data that won't be detected easily.
185185

186186
## Next steps
187187

3.55 KB
Loading
39 KB
Loading
537 Bytes
Loading
-26.8 KB
Loading
2.55 KB
Loading
603 Bytes
Loading

articles/search/search-blob-storage-integration.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: Search over Azure Blob Storage content
33
titleSuffix: Azure AI Search
4-
description: Learn about extracting text from Azure blobs and making it full-text searchable in an Azure AI Search index.
4+
description: Learn how to extract text from Azure blobs and making the content full-text searchable in an Azure AI Search index.
55

66
manager: nitinme
77
author: HeidiSteen
@@ -10,7 +10,7 @@ ms.service: cognitive-search
1010
ms.custom:
1111
- ignite-2023
1212
ms.topic: conceptual
13-
ms.date: 02/07/2023
13+
ms.date: 02/15/2024
1414
---
1515

1616
# Search over Azure Blob Storage content
@@ -24,11 +24,11 @@ In this article, review the basic workflow for extracting content and metadata f
2424
2525
## What it means to add full text search to blob data
2626

27-
Azure AI Search is a standalone search service that supports indexing and query workloads over user-defined indexes that contain your remote searchable content hosted in the cloud. Co-locating your searchable content with the query engine is necessary for performance, returning results at a speed users have come to expect from search queries.
27+
Azure AI Search is a standalone search service that supports indexing and query workloads over user-defined indexes that contain your private searchable content hosted in the cloud. Co-locating your searchable content with the query engine in the cloud is necessary for performance, returning results at a speed users have come to expect from search queries.
2828

2929
Azure AI Search integrates with Azure Blob Storage at the indexing layer, importing your blob content as search documents that are indexed into *inverted indexes* and other query structures that support free-form text queries and filter expressions. Because your blob content is indexed into a search index, you can use the full range of query features in Azure AI Search to find information in your blob content.
3030

31-
Inputs are your blobs, in a single container, in Azure Blob Storage. Blobs can be almost any kind of text data. If your blobs contain images, you can add [AI enrichment](cognitive-search-concept-intro.md) to create and extract text from images.
31+
Inputs are your blobs, in a single container, in Azure Blob Storage. Blobs can be almost any kind of text data. If your blobs contain images, you can add [AI enrichment](cognitive-search-concept-intro.md) to create and extract text and features from images.
3232

3333
Output is always an Azure AI Search index, used for fast text search, retrieval, and exploration in client applications. In between is the indexing pipeline architecture itself. The pipeline is based on the *indexer* feature, discussed further on in this article.
3434

@@ -42,11 +42,11 @@ Within Blob Storage, you'll need a container that provides source content. You c
4242

4343
You can start directly in your Storage Account portal page.
4444

45-
1. In the left navigation page under **Data management**, select **Azure search** to select or create a search service.
45+
1. In the left navigation page under **Data management**, select **Azure AI Search** to select or create a search service.
4646

47-
1. Follow the steps in the wizard to extract and optionally create searchable content from your blobs. The workflow is the [**Import data** wizard](cognitive-search-quickstart-blob.md).
47+
1. Follow the steps in the wizard to extract and optionally create searchable content from your blobs. The workflow is the [**Import data** wizard](cognitive-search-quickstart-blob.md). The workflow creates an indexer, data source, index, and option skillset on your Azure AI Search service.
4848

49-
:::image type="content" source="media/search-blob-storage-integration/blob-blade.png" alt-text="Screenshot of the Azure search wizard in the Azure Storage portal page." border="true":::
49+
:::image type="content" source="media/search-blob-storage-integration/blob-blade.png" alt-text="Screenshot of the Azure AI Search wizard in the Azure Storage portal page." border="true":::
5050

5151
1. Use [Search explorer](search-explorer.md) in the search portal page to query your content.
5252

@@ -70,7 +70,7 @@ Textual content of a document is extracted into a string field named "content".
7070

7171
An *indexer* is a data-source-aware subservice in Azure AI Search, equipped with internal logic for sampling data, reading and retrieving data and metadata, and serializing data from native formats into JSON documents for subsequent import.
7272

73-
Blobs in Azure Storage are indexed using the [blob indexer](search-howto-indexing-azure-blob-storage.md). You can invoke this indexer by using the **Azure search** command in Azure Storage, the **Import data** wizard, a REST API, or the .NET SDK. In code, you use this indexer by setting the type, and by providing connection information that includes an Azure Storage account along with a blob container. You can subset your blobs by creating a virtual directory, which you can then pass as a parameter, or by filtering on a file type extension.
73+
Blobs in Azure Storage are indexed using the [blob indexer](search-howto-indexing-azure-blob-storage.md). You can invoke this indexer by using the **Azure AI Search** command in Azure Storage, the **Import data** wizard, a REST API, or the .NET SDK. In code, you use this indexer by setting the type, and by providing connection information that includes an Azure Storage account along with a blob container. You can subset your blobs by creating a virtual directory, which you can then pass as a parameter, or by filtering on a file type extension.
7474

7575
An indexer ["cracks a document"](search-indexer-overview.md#document-cracking), opening a blob to inspect content. After connecting to the data source, it's the first step in the pipeline. For blob data, this is where PDF, Office docs, and other content types are detected. Document cracking with text extraction is no charge. If your blobs contain image content, images are ignored unless you [add AI enrichment](cognitive-search-concept-intro.md). Standard indexing applies only to text content.
7676

@@ -95,7 +95,7 @@ You can control which blobs are indexed, and which are skipped, by the blob's fi
9595
Include specific file extensions by setting `"indexedFileNameExtensions"` to a comma-separated list of file extensions (with a leading dot). Exclude specific file extensions by setting `"excludedFileNameExtensions"` to the extensions that should be skipped. If the same extension is in both lists, it will be excluded from indexing.
9696

9797
```http
98-
PUT /indexers/[indexer name]?api-version=2020-06-30
98+
PUT /indexers/[indexer name]?api-version=2023-11-01
9999
{
100100
"parameters" : {
101101
"configuration" : {

articles/search/search-more-like-this.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@ ms.author: beloh
88
ms.service: cognitive-search
99
ms.custom:
1010
- ignite-2023
11-
ms.topic: conceptual
12-
ms.date: 10/06/2022
11+
ms.topic: reference
12+
ms.date: 02/16/2024
1313
---
1414

1515
# moreLikeThis (preview) in Azure AI Search

0 commit comments

Comments
 (0)