You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/search/search-howto-create-indexers.md
+18-14Lines changed: 18 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,19 +15,21 @@ ms.date: 05/11/2022
15
15
16
16
# Creating indexers in Azure Cognitive Search
17
17
18
-
A search indexer connects to an external data source, retrieves and processes data, and then passes it to the search engine for indexing. Indexers support two workflows:
18
+
An indexer is a named object on a search service that automates an indexing workload by connecting to an external data source, retrieving and processing data, and then passing the data on to the search engine for indexing. Using indexers significantly reduces the quantity and complexity of the code you need to write.
19
+
20
+
Indexers support two workflows:
19
21
20
22
+ Text-based indexing, extracting strings and metadata for full text search scenarios.
21
23
22
-
+[AI-enriched indexing](cognitive-search-concept-intro.md), applying integrated machine learning and AI models to analyze content that isn't otherwise searchable, such as images and large undifferentiated text.
24
+
+Skills-based indexing, using built-in or custom skills to apply integrated machine learning and AI models that analyze content for text and structure. Skill-based indexing enables search over content that isn't otherwise easily searchable, such as images and large undifferentiated text. To learn about skills-based indexing, see [AI enrichment in Cognitive Search](cognitive-search-concept-intro.md).
23
25
24
-
Using indexers significantly reduces the quantity and complexity of the code you need to write. This article focuses on the basics of creating an indexer. Depending on the data source and your workflow, more configuration might be necessary.
26
+
This article focuses on the basic steps of creating an indexer. Depending on the data source and your workflow, more configuration might be necessary.
25
27
26
28
## Indexer definitions
27
29
28
-
When you create an indexer, the definition will adhere to one of two patterns: text-based indexing or AI enrichment with skills. The only difference is that an indexer that invokes AI enrichment has more definitions.
30
+
When you create an indexer, the definition will adhere to one of two patterns: text-based indexing or AI enrichment with skills. The patterns are the same except that skills-based indexing has more definitions.
29
31
30
-
### Indexer definition for full text search
32
+
### Indexer definition for text-based indexing
31
33
32
34
Full text search is the primary use case for indexers, and for this workflow, an indexer will look like this example.
33
35
@@ -57,19 +59,21 @@ Indexers have the following requirements:
57
59
+ A "dataSourceName" property that points to a data source object. It specifies a connection to external data.
58
60
+ A "targetIndexName" property that points to the destination search index.
59
61
60
-
Parameters are optional and modify run time behaviors, such as how many errors to accept before failing the entire job. The parameters above are available for all indexers and are documented in the [REST API reference](/rest/api/searchservice/create-indexer#request-body).
62
+
Other parameters are optional and modify run time behaviors, such as how many errors to accept before failing the entire job. The parameters above are available for all indexers and are documented in the [REST API reference](/rest/api/searchservice/create-indexer#request-body).
61
63
62
-
Source-specific indexers for blobs, SQL, and Azure Cosmos DB provide extra "configuration" parameters for source-specific behaviors. For example, if the source is Blob Storage, you can set a parameter that filters on file extensions: `"parameters" : { "configuration" : { "indexedFileNameExtensions" : ".pdf,.docx" } }`.
64
+
Data source-specific indexers for blobs, SQL, and Azure Cosmos DB provide extra "configuration" parameters for source-specific behaviors. For example, if the source is Blob Storage, you can set a parameter that filters on file extensions: `"parameters" : { "configuration" : { "indexedFileNameExtensions" : ".pdf,.docx" } }`. If the source is Azure SQL, you can set a query time out parameter.
63
65
64
-
[Field mappings](search-indexer-field-mappings.md) are used to explicitly map source-to-destination fields if those fields differ by name or type.
66
+
[Field mappings](search-indexer-field-mappings.md) are used to explicitly map source-to-destination fields if there are discrepancies by name or type between a field in the data source and a field in the search index.
65
67
66
-
An indexer will run immediately when you create it on the search service. If you don't want indexer execution, set "disabled" to true.
68
+
By default, an indexer runs immediately when you create it on the search service. If you don't want indexer execution, set "disabled" to true when creating the indexer.
67
69
68
70
You can also [specify a schedule](search-howto-schedule-indexers.md) or set an [encryption key](search-security-manage-encryption-keys.md) for supplemental encryption of the indexer definition.
69
71
70
-
### Indexer definition for AI enrichment
72
+
### Indexer definition for skills-based indexing and AI enrichment
73
+
74
+
Indexers also drive [AI enrichment](cognitive-search-concept-intro.md). All of the above properties and parameters apply, but the following extra properties are specific to AI enrichment: **`skillSetName`**, **`outputFieldMappings`**, **`cache`**.
71
75
72
-
Indexers also drive [AI enrichment](cognitive-search-concept-intro.md). All of the above properties and parameters apply, but the following properties are specific to AI enrichment: **`skillSetName`**, **`outputFieldMappings`**, **`cache`**. A [skillset](cognitive-search-defining-skillset.md) also has **`cognitiveServices`**, and **`knowledgeStore`**. A few other required and similarly named properties are added for context.
76
+
A [skillset](cognitive-search-defining-skillset.md) also has **`cognitiveServices`**, and **`knowledgeStore`**. A few other required and similarly named properties are added for context.
73
77
74
78
```json
75
79
{
@@ -226,11 +230,11 @@ If your data source supports change detection, an indexer can detect underlying
226
230
227
231
Change detection logic is built into the data platforms. How an indexer supports change detection varies by data source:
228
232
229
-
+ Azure Storage has built-in change detection, which means an indexer can recognize new and updated documents automatically. Blob Storage, Azure Table Storage, and Azure Data Lake Storage Gen2 stamp each blob or row update with a date and time. An indexer can use this information to determine which documents to update in the index.
233
+
+ Azure Storage has built-in change detection, which means an indexer can recognize new and updated documents automatically. Blob Storage, Azure Table Storage, and Azure Data Lake Storage Gen2 stamp each blob or row update with a date and time. An indexer automatically uses this information to determine which documents to update in the index.
230
234
231
-
+ Azure SQL and Azure Cosmos DB provide optional change detection features in their platforms. You can specify the change detection policy in your data source definition.
235
+
+ Azure SQL and Azure Cosmos DB provide optional change detection features in their platforms. For these data sources, change detection isn't automatic. You'll need to specify in the data source definition which change detection policy is used.
232
236
233
-
For large indexing loads, an indexer also keeps track of the last document it processed through an internal "high water mark". The marker is never exposed in the API, but internally the indexer keeps track of where it stopped. When indexing resumes, either through a scheduled run or an on-demand invocation, the indexer references the high water mark so that it can pick up where it left off.
237
+
Indexers keep track of the last document it processed from the data source through an internal "high water mark". The marker is never exposed in the API, but internally the indexer keeps track of where it stopped. When indexing resumes, either through a scheduled run or an on-demand invocation, the indexer references the high water mark so that it can pick up where it left off.
234
238
235
239
If you need to clear the high water mark to reindex in full, you can use [Reset Indexer](/rest/api/searchservice/reset-indexer). For more selective reindexing, use [Reset Skills](/rest/api/searchservice/preview-api/reset-skills) or [Reset Documents](/rest/api/searchservice/preview-api/reset-documents). Through the reset APIs, you can clear internal state, and also flush the cache if you enabled [incremental enrichment](search-howto-incremental-index.md). For more background and comparison of each reset option, see [Run or reset indexers, skills, and documents](search-howto-run-reset-indexers.md).
Copy file name to clipboardExpand all lines: articles/search/search-howto-run-reset-indexers.md
+17-11Lines changed: 17 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,19 +13,19 @@ ms.date: 01/07/2022
13
13
14
14
# Run or reset indexers, skills, or documents
15
15
16
-
Indexers can be invoked in three ways: on demand, on a schedule, or when the [indexer is created](/rest/api/searchservice/create-indexer), assuming it's not created in "disabled" mode. This article explains how to run indexers on demand, with and without a reset.
16
+
In Azure Cognitive Search, there are several ways to run an indexer:
17
17
18
-
## Resetting indexers
18
+
+[Run when creating or updating an indexer](search-howto-create-indexers.md), assuming it's not created in "disabled" mode.
19
+
+[Run on a schedule](search-howto-schedule-indexers.md) to invoke execution at regular intervals.
20
+
+ Run on demand, with or without a "reset".
19
21
20
-
After the initial run, an indexer keeps track of which search documents have been indexed through an internal *high-water mark*. The marker is never exposed, but internally the indexer knows where it last stopped, so that it can pick up where it left off on the next run.
22
+
This article explains how to run indexers on demand, with and without a reset.
21
23
22
-
If you need to rebuild all or part of an index, you can clear the indexer's high-water mark through a reset. Reset APIs are available at decreasing levels in the object hierarchy:
24
+
## Run without reset
23
25
24
-
+[Reset Indexers](#reset-indexers) clears the high-water mark and performs a full reindex of all documents
25
-
+[Reset Documents (preview)](#reset-docs) reindexes a specific document or list of documents
26
-
+[Reset Skills (preview)](#reset-skills) invokes skill processing for a specific skill
26
+
[Run Indexer](/rest/api/searchservice/run-indexer) will detect and process only what it necessary to synchronize the search index with changes in the underlying data source. Incremental indexing starts by locating an internal high-water mark to find the last updated search document, which becomes the starting point for indexer execution over new and updated documents in the data source.
27
27
28
-
After reset, follow with a Run command to reprocess new and existing documents. Orphaned search documents having no counterpart in the data source cannot be removed through reset/run. If you need to delete documents, see [Add, Update or Delete Documents](/rest/api/searchservice/addupdate-or-delete-documents) instead.
28
+
Change detection is essential for determining what's new or updated in the data source. If the content is unchanged, Run has no effect. Blob storage has built-in change detection through its LastModified property. Other data sources, such as Azure SQL or Azure Cosmos DB, have to be configured for change detection before the indexer can read new and updated rows.
29
29
30
30
## Indexer execution
31
31
@@ -49,11 +49,17 @@ Indexer limits vary by the workload. For each workload, the following job limits
49
49
> [!TIP]
50
50
> If you are [indexing a large data set](search-howto-large-index.md), you can stretch processing out by putting the indexer [on a schedule](search-howto-schedule-indexers.md). For the full list of all indexer-related limits, see [indexer limits](search-limits-quotas-capacity.md#indexer-limits)
51
51
52
-
## Run without reset
52
+
## Resetting indexers
53
53
54
-
[Run Indexer](/rest/api/searchservice/run-indexer) will detect and process only what it necessary to synchronize the search index with changes in the underlying data source. Incremental indexing starts by locating an internal high-water mark to find the last updated search document, which becomes the starting point for indexer execution over new and updated documents in the data source.
54
+
After the initial run, an indexer keeps track of which search documents have been indexed through an internal *high-water mark*. The marker is never exposed, but internally the indexer knows where it last stopped, so that it can pick up where it left off on the next run.
55
55
56
-
Change detection is essential for determining what's new or updated in the data source. If the content is unchanged, Run has no effect. Blob storage has built-in change detection through its LastModified property. Other data sources, such as Azure SQL or Azure Cosmos DB, have to be configured for change detection before the indexer can read new and updated rows.
56
+
If you need to rebuild all or part of an index, you can clear the indexer's high-water mark through a reset. Reset APIs are available at decreasing levels in the object hierarchy:
57
+
58
+
+[Reset Indexers](#reset-indexers) clears the high-water mark and performs a full reindex of all documents
59
+
+[Reset Documents (preview)](#reset-docs) reindexes a specific document or list of documents
60
+
+[Reset Skills (preview)](#reset-skills) invokes skill processing for a specific skill
61
+
62
+
After reset, follow with a Run command to reprocess new and existing documents. Orphaned search documents having no counterpart in the data source cannot be removed through reset/run. If you need to delete documents, see [Add, Update or Delete Documents](/rest/api/searchservice/addupdate-or-delete-documents) instead.
Copy file name to clipboardExpand all lines: articles/search/search-howto-schedule-indexers.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -109,11 +109,11 @@ Skills-based indexers run in a different [execution environment](search-howto-ru
109
109
110
110
Although multiple indexers can run simultaneously, a given indexer is single instance. You can't run two copies of the same indexer concurrently. If an indexer happens to still be running when its next scheduled execution is set to start, the pending execution is postponed until the next scheduled occurrence, allowing the current job to finish.
111
111
112
-
Let’s consider an example to make this more concrete. Suppose we configure an indexer schedule with an interval of hourly and a start time of June 1, 2021 at 8:00:00 AM UTC. Here's what could happen when an indexer run takes longer than an hour:
112
+
Let’s consider an example to make this more concrete. Suppose we configure an indexer schedule with an interval of hourly and a start time of June 1, 2022 at 8:00:00 AM UTC. Here's what could happen when an indexer run takes longer than an hour:
113
113
114
-
+ The first indexer execution starts at or around June 1, 2021 at 8:00 AM UTC. Assume this execution takes 20 minutes (or any time less than 1 hour).
114
+
+ The first indexer execution starts at or around June 1, 2022 at 8:00 AM UTC. Assume this execution takes 20 minutes (or any amount of time that's less than 1 hour).
115
115
116
-
+ The second execution starts at or around June 1, 2021 9:00 AM UTC. Suppose that this execution takes 70 minutes - more than an hour – and it will not complete until 10:10 AM UTC.
116
+
+ The second execution starts at or around June 1, 2022 9:00 AM UTC. Suppose that this execution takes 70 minutes - more than an hour – and it will not complete until 10:10 AM UTC.
117
117
118
118
+ The third execution is scheduled to start at 10:00 AM UTC, but at that time the previous execution is still running. This scheduled execution is then skipped. The next execution of the indexer won't start until 11:00 AM UTC.
0 commit comments