You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/search/cognitive-search-incremental-indexing-conceptual.md
+20-20Lines changed: 20 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,27 +8,32 @@ ms.service: cognitive-search
8
8
ms.custom:
9
9
- ignite-2023
10
10
ms.topic: conceptual
11
-
ms.date: 04/21/2023
11
+
ms.date: 02/16/2024
12
12
---
13
13
14
14
# Incremental enrichment and caching in Azure AI Search
15
15
16
16
> [!IMPORTANT]
17
17
> This feature is in public preview under [supplemental terms of use](https://azure.microsoft.com/support/legal/preview-supplemental-terms/). The [preview REST API](/rest/api/searchservice/index-preview) supports this feature.
18
18
19
-
*Incremental enrichment* refers to the use of cached enrichments during [skillset execution](cognitive-search-working-with-skillsets.md) so that only new and changed skills and documents incur AI processing. The cache contains the output from [document cracking](search-indexer-overview.md#document-cracking), plus the outputs of each skill for every document. Although caching is billable (it uses Azure Storage), the overall cost of enrichment is reduced because the costs of storage are less than image extraction and AI processing.
19
+
*Incremental enrichment* refers to the use of cached enrichments during [skillset execution](cognitive-search-working-with-skillsets.md) so that only new and changed skills and documents incur AI processing charges. The cache contains the output from [document cracking](search-indexer-overview.md#document-cracking), plus the outputs of each skill for every document. Although caching is billable (it uses Azure Storage), the overall cost of enrichment is reduced because the costs of storage are less than image extraction and AI processing.
20
20
21
21
When you enable caching, the indexer evaluates your updates to determine whether existing enrichments can be pulled from the cache. Image and text content from the document cracking phase, plus skill outputs that are upstream or orthogonal to your edits, are likely to be reusable.
22
22
23
-
After performing the incremental enrichments as indicated by the skillset update, refreshed results are written back to the cache, and also to the search index or knowledge store.
23
+
After skillset processing is finished, the refreshed results are written back to the cache, and also to the search index or knowledge store.
24
+
25
+
## Limitations
26
+
27
+
> [!CAUTION]
28
+
> If you're using the [SharePoint Online indexer (Preview)](search-howto-index-sharepoint-online.md), you should avoid incremental enrichment. Under certain circumstances, the cache becomes invalid, requiring an [indexer reset and run](search-howto-run-reset-indexers.md), should you choose to reload it.
24
29
25
30
## Cache configuration
26
31
27
-
Physically, the cache is stored in a blob container in your Azure Storage account, one per indexer. Each indexer is assigned a unique and immutable cache identifier that corresponds to the container it is using.
32
+
Physically, the cache is stored in a blob container in your Azure Storage account, one per indexer. Each indexer is assigned a unique and immutable cache identifier that corresponds to the container it's using.
28
33
29
-
The cache is created when you specify the "cache" property and run the indexer. Only enriched content can be cached. If your indexer does not have an attached skillset, then caching does not apply.
34
+
The cache is created when you specify the "cache" property and run the indexer. Only enriched content can be cached. If your indexer doesn't have an attached skillset, then caching doesn't apply.
30
35
31
-
The following example illustrates an indexer with caching enabled. See [Enable enrichment caching](search-howto-incremental-index.md) for full instructions. Notice that when adding the cache property, use preview API version, 2020-06-30-Preview or later, on the request.
36
+
The following example illustrates an indexer with caching enabled. See [Enable enrichment caching](search-howto-incremental-index.md) for full instructions. Notice that when adding the cache property, use a [preview API version](/rest/api/searchservice/search-service-api-versions#preview-versions), 2020-06-30-Preview or later, on the request.
32
37
33
38
```json
34
39
POST https://[search service name].search.windows.net/indexers?api-version=2020-06-30-Preview
@@ -49,7 +54,7 @@ POST https://[search service name].search.windows.net/indexers?api-version=2020-
49
54
50
55
## Cache management
51
56
52
-
The lifecycle of the cache is managed by the indexer. If an indexer is deleted, its cache is also deleted. If the "cache" property on the indexer is set to null or the connection string is changed, the existing cache is deleted on the next indexer run.
57
+
The lifecycle of the cache is managed by the indexer. If an indexer is deleted, its cache is also deleted. If the `cache` property on the indexer is set to null or the connection string is changed, the existing cache is deleted on the next indexer run.
53
58
54
59
While incremental enrichment is designed to detect and respond to changes with no intervention on your part, there are parameters you can use to invoke specific behaviors:
55
60
@@ -62,17 +67,17 @@ While incremental enrichment is designed to detect and respond to changes with n
62
67
63
68
### Prioritize new documents
64
69
65
-
The "cache" property includes an "enableReprocessing" parameter. It's used to control processing over incoming documents already represented in the cache. When true (default), documents already in the cache are reprocessed when you rerun the indexer, assuming your skill update affects that doc.
70
+
The cache property includes an `enableReprocessing` parameter. It's used to control processing over incoming documents already represented in the cache. When true (default), documents already in the cache are reprocessed when you rerun the indexer, assuming your skill update affects that doc.
66
71
67
-
When false, existing documents are not reprocessed, effectively prioritizing new, incoming content over existing content. You should only set "enableReprocessing" to false on a temporary basis. Having "enableReprocessing" set to true most of the time ensures that all documents, both new and existing, are valid per the current skillset definition.
72
+
When false, existing documents aren't reprocessed, effectively prioritizing new, incoming content over existing content. You should only set enableReprocessing to false on a temporary basis. Having enableReprocessing set to true most of the time ensures that all documents, both new and existing, are valid per the current skillset definition.
68
73
69
74
<aname="Bypass-skillset-checks"></a>
70
75
71
76
### Bypass skillset evaluation
72
77
73
-
Modifying a skill and reprocessing of that skill typically go hand in hand. However, some changes to a skill should not result in reprocessing (for example, deploying a custom skill to a new location or with a new access key). Most likely, these are peripheral modifications that have no genuine impact on the substance of the skill output itself.
78
+
Modifying a skill and reprocessing of that skill typically go hand in hand. However, some changes to a skill shouldn't result in reprocessing (for example, deploying a custom skill to a new location or with a new access key). Most likely, these are peripheral modifications that have no genuine impact on the substance of the skill output itself.
74
79
75
-
If you know that a change to the skill is indeed superficial, you should override skill evaluation by setting the "disableCacheReprocessingChangeDetection" parameter to true:
80
+
If you know that a change to the skill is indeed superficial, you should override skill evaluation by setting the `disableCacheReprocessingChangeDetection` parameter to true:
76
81
77
82
1. Call [Update Skillset](/rest/api/searchservice/update-skillset) and modify the skillset definition.
78
83
1. Append the "disableCacheReprocessingChangeDetection=true" parameter on the request.
@@ -89,7 +94,7 @@ PUT https://[servicename].search.windows.net/skillsets/[skillset name]?api-versi
89
94
90
95
### Bypass data source validation checks
91
96
92
-
Most changes to a data source definition will invalidate the cache. However, for scenarios where you know that a change should not invalidate the cache - such as changing a connection string or rotating the key on the storage account - append the "ignoreResetRequirement" parameter on the [data source update](/rest/api/searchservice/update-data-source). Setting this parameter to true allows the commit to go through, without triggering a reset condition that would result in all objects being rebuilt and populated from scratch.
97
+
Most changes to a data source definition will invalidate the cache. However, for scenarios where you know that a change shouldn't invalidate the cache - such as changing a connection string or rotating the key on the storage account - append the `ignoreResetRequirement` parameter on the [data source update](/rest/api/searchservice/update-data-source). Setting this parameter to true allows the commit to go through, without triggering a reset condition that would result in all objects being rebuilt and populated from scratch.
93
98
94
99
```http
95
100
PUT https://[search service].search.windows.net/datasources/[data source name]?api-version=2020-06-30-Preview&ignoreResetRequirement
@@ -135,7 +140,7 @@ POST https://[search service name].search.windows.net/indexers/[indexer name]/re
135
140
136
141
Once you enable a cache, the indexer evaluates changes in your pipeline composition to determine which content can be reused and which needs reprocessing. This section enumerates changes that invalidate the cache outright, followed by changes that trigger incremental processing.
137
142
138
-
An invalidating change is one where the entire cache is no longer valid. An example of an invalidating change is one where your data source is updated. Here is the complete list of changes to any part of the indexer pipeline that would invalidate your cache:
143
+
An invalidating change is one where the entire cache is no longer valid. An example of an invalidating change is one where your data source is updated. Here's the complete list of changes to any part of the indexer pipeline that would invalidate your cache:
139
144
140
145
+ Changing the data source type
141
146
+ Changing data source container
@@ -155,7 +160,7 @@ An invalidating change is one where the entire cache is no longer valid. An exam
155
160
156
161
## Changes that trigger incremental processing
157
162
158
-
Incremental processing evaluates your skillset definition and determines which skills to rerun, selectively updating the affected portions of the document tree. Here is the complete list of changes resulting in incremental enrichment:
163
+
Incremental processing evaluates your skillset definition and determines which skills to rerun, selectively updating the affected portions of the document tree. Here's the complete list of changes resulting in incremental enrichment:
159
164
160
165
+ Changing the skill type (the OData type of the skill is updated)
161
166
+ Skill-specific parameters are updated, for example a URL, defaults, or other parameters
@@ -176,12 +181,7 @@ REST API version `2020-06-30-Preview` or later provides incremental enrichment t
+[Update Data Source](/rest/api/searchservice/update-data-source), when called with a preview API version, provides a new parameter named "ignoreResetRequirement", which should be set to true when your update action should not invalidate the cache. Use "ignoreResetRequirement" sparingly as it could lead to unintended inconsistency in your data that will not be detected easily.
180
-
181
-
## Limitations
182
-
183
-
> [!CAUTION]
184
-
> If you're using the [SharePoint Online indexer (Preview)](search-howto-index-sharepoint-online.md), you should avoid incremental enrichment. Under certain circumstances, the cache becomes invalid, requiring an [indexer reset and run](search-howto-run-reset-indexers.md), should you choose to reload it.
184
+
+[Update Data Source](/rest/api/searchservice/update-data-source), when called with a preview API version, provides a new parameter named "ignoreResetRequirement", which should be set to true when your update action shouldn't invalidate the cache. Use "ignoreResetRequirement" sparingly as it could lead to unintended inconsistency in your data that won't be detected easily.
Copy file name to clipboardExpand all lines: articles/search/search-blob-storage-integration.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
2
title: Search over Azure Blob Storage content
3
3
titleSuffix: Azure AI Search
4
-
description: Learn about extracting text from Azure blobs and making it full-text searchable in an Azure AI Search index.
4
+
description: Learn how to extract text from Azure blobs and making the content full-text searchable in an Azure AI Search index.
5
5
6
6
manager: nitinme
7
7
author: HeidiSteen
@@ -10,7 +10,7 @@ ms.service: cognitive-search
10
10
ms.custom:
11
11
- ignite-2023
12
12
ms.topic: conceptual
13
-
ms.date: 02/07/2023
13
+
ms.date: 02/15/2024
14
14
---
15
15
16
16
# Search over Azure Blob Storage content
@@ -24,11 +24,11 @@ In this article, review the basic workflow for extracting content and metadata f
24
24
25
25
## What it means to add full text search to blob data
26
26
27
-
Azure AI Search is a standalone search service that supports indexing and query workloads over user-defined indexes that contain your remote searchable content hosted in the cloud. Co-locating your searchable content with the query engine is necessary for performance, returning results at a speed users have come to expect from search queries.
27
+
Azure AI Search is a standalone search service that supports indexing and query workloads over user-defined indexes that contain your private searchable content hosted in the cloud. Co-locating your searchable content with the query engine in the cloud is necessary for performance, returning results at a speed users have come to expect from search queries.
28
28
29
29
Azure AI Search integrates with Azure Blob Storage at the indexing layer, importing your blob content as search documents that are indexed into *inverted indexes* and other query structures that support free-form text queries and filter expressions. Because your blob content is indexed into a search index, you can use the full range of query features in Azure AI Search to find information in your blob content.
30
30
31
-
Inputs are your blobs, in a single container, in Azure Blob Storage. Blobs can be almost any kind of text data. If your blobs contain images, you can add [AI enrichment](cognitive-search-concept-intro.md) to create and extract text from images.
31
+
Inputs are your blobs, in a single container, in Azure Blob Storage. Blobs can be almost any kind of text data. If your blobs contain images, you can add [AI enrichment](cognitive-search-concept-intro.md) to create and extract text and features from images.
32
32
33
33
Output is always an Azure AI Search index, used for fast text search, retrieval, and exploration in client applications. In between is the indexing pipeline architecture itself. The pipeline is based on the *indexer* feature, discussed further on in this article.
34
34
@@ -42,11 +42,11 @@ Within Blob Storage, you'll need a container that provides source content. You c
42
42
43
43
You can start directly in your Storage Account portal page.
44
44
45
-
1. In the left navigation page under **Data management**, select **Azure search** to select or create a search service.
45
+
1. In the left navigation page under **Data management**, select **Azure AI Search** to select or create a search service.
46
46
47
-
1. Follow the steps in the wizard to extract and optionally create searchable content from your blobs. The workflow is the [**Import data** wizard](cognitive-search-quickstart-blob.md).
47
+
1. Follow the steps in the wizard to extract and optionally create searchable content from your blobs. The workflow is the [**Import data** wizard](cognitive-search-quickstart-blob.md). The workflow creates an indexer, data source, index, and option skillset on your Azure AI Search service.
48
48
49
-
:::image type="content" source="media/search-blob-storage-integration/blob-blade.png" alt-text="Screenshot of the Azure search wizard in the Azure Storage portal page." border="true":::
49
+
:::image type="content" source="media/search-blob-storage-integration/blob-blade.png" alt-text="Screenshot of the Azure AI Search wizard in the Azure Storage portal page." border="true":::
50
50
51
51
1. Use [Search explorer](search-explorer.md) in the search portal page to query your content.
52
52
@@ -70,7 +70,7 @@ Textual content of a document is extracted into a string field named "content".
70
70
71
71
An *indexer* is a data-source-aware subservice in Azure AI Search, equipped with internal logic for sampling data, reading and retrieving data and metadata, and serializing data from native formats into JSON documents for subsequent import.
72
72
73
-
Blobs in Azure Storage are indexed using the [blob indexer](search-howto-indexing-azure-blob-storage.md). You can invoke this indexer by using the **Azure search** command in Azure Storage, the **Import data** wizard, a REST API, or the .NET SDK. In code, you use this indexer by setting the type, and by providing connection information that includes an Azure Storage account along with a blob container. You can subset your blobs by creating a virtual directory, which you can then pass as a parameter, or by filtering on a file type extension.
73
+
Blobs in Azure Storage are indexed using the [blob indexer](search-howto-indexing-azure-blob-storage.md). You can invoke this indexer by using the **Azure AI Search** command in Azure Storage, the **Import data** wizard, a REST API, or the .NET SDK. In code, you use this indexer by setting the type, and by providing connection information that includes an Azure Storage account along with a blob container. You can subset your blobs by creating a virtual directory, which you can then pass as a parameter, or by filtering on a file type extension.
74
74
75
75
An indexer ["cracks a document"](search-indexer-overview.md#document-cracking), opening a blob to inspect content. After connecting to the data source, it's the first step in the pipeline. For blob data, this is where PDF, Office docs, and other content types are detected. Document cracking with text extraction is no charge. If your blobs contain image content, images are ignored unless you [add AI enrichment](cognitive-search-concept-intro.md). Standard indexing applies only to text content.
76
76
@@ -95,7 +95,7 @@ You can control which blobs are indexed, and which are skipped, by the blob's fi
95
95
Include specific file extensions by setting `"indexedFileNameExtensions"` to a comma-separated list of file extensions (with a leading dot). Exclude specific file extensions by setting `"excludedFileNameExtensions"` to the extensions that should be skipped. If the same extension is in both lists, it will be excluded from indexing.
96
96
97
97
```http
98
-
PUT /indexers/[indexer name]?api-version=2020-06-30
98
+
PUT /indexers/[indexer name]?api-version=2023-11-01
0 commit comments