Skip to content

Commit 1237791

Browse files
authored
Merge pull request #177789 from HeidiSteen/heidist-gh
[azure search] indexer edits
2 parents 345a8ca + 8ba7b18 commit 1237791

File tree

3 files changed

+40
-42
lines changed

3 files changed

+40
-42
lines changed

articles/search/knowledge-store-concept-intro.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ The following video explains both of these benefits and more.
4040

4141
A knowledge store is defined inside a skillset definition and it has two components:
4242

43-
+ a connection string to Azure Storage
43+
+ A connection string to Azure Storage
4444

4545
+ [**Projections**](knowledge-store-projection-overview.md) that determine whether the knowledge store consists of tables, objects or files.
4646

articles/search/search-howto-create-indexers.md

Lines changed: 24 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -8,20 +8,23 @@ author: HeidiSteen
88
ms.author: heidist
99
ms.service: cognitive-search
1010
ms.topic: conceptual
11-
ms.date: 01/28/2021
11+
ms.date: 11/02/2021
1212
---
1313

1414
# Creating indexers in Azure Cognitive Search
1515

16-
A search indexer provides an automated workflow for transferring documents and content from an external data source, to a search index on your search service. As originally designed, it extracts text and metadata from an Azure data source, serializes documents into JSON, and passes off the resulting documents to a search engine for indexing. It's since been extended to support [AI enrichment](cognitive-search-concept-intro.md) for deep content processing.
16+
A search indexer provides an automated workflow for reading content from an external data source, and ingesting that content into a search index on your search service. Indexers support two workflows:
17+
18+
+ Extracting text and metadata for full text search
19+
+ Analyzing images and large undifferentiated text for text and structure, adding [AI enrichment](cognitive-search-concept-intro.md) to the pipeline for deeper content processing.
1720

1821
Using indexers significantly reduces the quantity and complexity of the code you need to write. This article focuses on the mechanics of creating an indexer as preparation for more advanced work with source-specific indexers and [skillsets](cognitive-search-working-with-skillsets.md).
1922

20-
## What's an indexer definition?
23+
## Indexer structure
2124

22-
Indexers are used for either text-based indexing that pulls alphanumeric content from source fields into index fields, or AI-based processing that analyzes undifferentiated text for structure, or analyzes images for text and information, also adding that content to an index. The following index definitions are typical of what you might create for either scenario.
25+
The following index definitions are typical of what you might create for text-based and AI enrichment scenarios.
2326

24-
### Indexers for text content
27+
### Indexing for full text search
2528

2629
The original purpose of an indexer was to simplify the complex process of loading an index by providing a mechanism for connecting to and reading text and numeric content from fields in a data source, serialize that content as JSON documents, and hand off those documents to the search engine for indexing. This is still a primary use case, and for this operation, you'll need to create an indexer with the properties defined in the following example.
2730

@@ -45,7 +48,7 @@ The **`parameters`** property modifies run time behaviors, such as how many erro
4548

4649
The **`field mappings`** property is used to explicitly map source-to-destination fields if those fields differ by name or type. Other properties (not shown), are used to [specify a schedule](search-howto-schedule-indexers.md), create the indexer in a disabled state, or specify an [encryption key](search-security-manage-encryption-keys.md) for supplemental encryption of data at rest.
4750

48-
### Indexers for AI indexing
51+
### Indexing for AI enrichment
4952

5053
Because indexers are the mechanism by which a search service makes outbound requests, indexers were extended to support AI enrichments, adding infrastructure and objects to implement this use case.
5154

@@ -73,27 +76,27 @@ All of the above properties and parameters apply to indexers that perform AI enr
7376

7477
AI enrichment is beyond the scope of this article. For more information, start with these articles: [AI enrichment](cognitive-search-concept-intro.md), [Skillsets in Azure Cognitive Search](cognitive-search-working-with-skillsets.md), and [Create Skillset (REST)](/rest/api/searchservice/create-skillset).
7578

76-
## Choose an indexer client and create the indexer
79+
## Prerequisites
7780

78-
When you are ready to create an indexer on a remote search service, you will need a search client in the form of a tool, like Azure portal or Postman, or code that instantiates an indexer client. We recommend the Azure portal or REST APIs for early development and proof-of-concept testing.
81+
+ Use a [supported data source](search-indexer-overview.md#supported-data-sources).
7982

80-
### Permissions
83+
+ Have admin rights. All operations related to indexers, including GET requests for status or definitions, require an [admin api-key](search-security-api-keys.md) on the request.
8184

82-
All operations related to indexers, including GET requests for status or definitions, require an [admin api-key](search-security-api-keys.md) on the request.
85+
All [service tiers limit](search-limits-quotas-capacity.md#indexer-limits) the number of objects that you can create. If you are experimenting on the Free tier, you can only have 3 objects of each type and 2 minutes of indexer processing (not including skillset processing).
8386

84-
### Limits
87+
## How to create indexers
8588

86-
All [service tiers limit](search-limits-quotas-capacity.md#indexer-limits) the number of objects that you can create. If you are experimenting on the Free tier, you can only have 3 objects of each type and 2 minutes of indexer processing (not including skillset processing).
89+
When you are ready to create an indexer on a remote search service, you will need a search client in the form of a tool, like Azure portal or Postman, or code that instantiates an indexer client. We recommend the Azure portal or REST APIs for early development and proof-of-concept testing.
8790

88-
### Use Azure portal to create an indexer
91+
### [**Azure portal**](#tab/indexer-portal)
8992

9093
The portal provides two options for creating an indexer: [**Import data wizard**](search-import-data-portal.md) and **New Indexer** that provides fields for specifying an indexer definition. The wizard is unique in that it creates all of the required elements. Other approaches require that you have predefined a data source and index.
9194

9295
The following screenshot shows where you can find these features in the portal.
9396

9497
:::image type="content" source="media/search-howto-create-indexers/portal-indexer-client.png" alt-text="hotels indexer" border="true":::
9598

96-
### Use a REST client
99+
### [**REST**](#tab/kstore-rest)
97100

98101
Both Postman and Visual Studio Code (with an extension for Azure Cognitive Search) can function as an indexer client. Using either tool, you can connect to your search service and send [Create Indexer (REST)](/rest/api/searchservice/create-indexer) requests. There are numerous tutorials and examples that demonstrate REST clients for creating objects.
99102

@@ -104,7 +107,7 @@ Start with either of these articles to learn about each client:
104107

105108
Refer to the [Indexer operations (REST)](/rest/api/searchservice/Indexer-operations) for help with formulating indexer requests.
106109

107-
### Use an SDK
110+
### [**.NET SDK**](#tab/kstore-dotnet)
108111

109112
For Cognitive Search, the Azure SDKs implement generally available features. As such, you can use any of the SDKs to create indexer-related objects. All of them provide a **SearchIndexerClient** that has methods for creating indexers and related objects, including skillsets.
110113

@@ -115,9 +118,11 @@ For Cognitive Search, the Azure SDKs implement generally available features. As
115118
| JavaScript | [SearchIndexerClient](/javascript/api/@azure/search-documents/searchindexerclient) | [Indexers](https://github.com/Azure/azure-sdk-for-js/tree/main/sdk/search/search-documents/samples/v11/javascript) |
116119
| Python | [SearchIndexerClient](/python/api/azure-search-documents/azure.search.documents.indexes.searchindexerclient) | [sample_indexers_operations.py](https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/search/azure-search-documents/samples/sample_indexers_operations.py) |
117120

121+
---
122+
118123
## Run the indexer
119124

120-
An indexer runs automatically when you create the indexer on the service. This is the moment of truth where you will find out if there are data source connection errors, field mapping issues, or skillset problems.
125+
Unless you set the **`disabled=true`** in the indexer definition, an indexer runs immediately when you create the indexer on the service. This is the moment of truth where you will find out if there are data source connection errors, field mapping issues, or skillset problems.
121126

122127
There are several ways to run an indexer:
123128

@@ -127,9 +132,6 @@ There are several ways to run an indexer:
127132

128133
+ Run a program that calls SearchIndexerClient methods for create, update, or run.
129134

130-
> [!NOTE]
131-
> To avoid immediately running an indexer upon creation, include **`disabled=true`** in the indexer definition.
132-
133135
Alternatively, put the indexer [on a schedule](search-howto-schedule-indexers.md) to invoke processing at regular intervals.
134136

135137
Scheduled processing usually coincides with a need for incremental indexing of changed content. Change detection logic is a capability that's built into source platforms. Changes in a blob container are detected by the indexer automatically. For guidance on leveraging change detection in other data sources, refer to the indexer docs for specific data sources:
@@ -153,17 +155,17 @@ For large indexing loads, an indexer also keeps track of the last document it pr
153155

154156
If you need to clear the high water mark to re-index in full, you can use [Reset Indexer](/rest/api/searchservice/reset-indexer). For more selective re-indexing, use [Reset Skills](/rest/api/searchservice/preview-api/reset-skills) or [Reset Documents](/rest/api/searchservice/preview-api/reset-documents). Through the reset APIs, you can clear internal state, and also flush the cache if you enabled [incremental enrichment](search-howto-incremental-index.md). For more background and comparison of each reset option, see [Run or reset indexers, skills, and documents](search-howto-run-reset-indexers.md).
155157

156-
## Know your data
158+
## Data preparation
157159

158-
Indexers expect a tabular row set, where each row becomes a full or partial search document in the index. Often, there is a one-to-one correspondence between a row and the resulting search document, where all the fields in the row set fully populate each document. But you can use indexers to generate just part of a document, for example if you're using multiple indexers or approaches to build out the index.
160+
Indexers expect a tabular row set, where each row becomes a full or partial search document in the index. Often, there is a one-to-one correspondence between a row in a database and the resulting search document, where all the fields in the row set fully populate each document. But you can use indexers to generate a subset of a document's fields, and fill in the remaining fields using a different indexer or methodology.
159161

160162
To flatten relational data into a row set, you should create a SQL view, or build a query that returns parent and child records in the same row. For example, the built-in hotels sample dataset is a SQL database that has 50 records (one for each hotel), linked to room records in a related table. The query that flattens the collective data into a row set embeds all of the room information in JSON documents in each hotel record. The embedded room information is a generated by a query that uses a **FOR JSON AUTO** clause. You can learn more about this technique in [define a query that returns embedded JSON](index-sql-relational-data.md#define-a-query-that-returns-embedded-json). This is just one example; you can find other approaches that will produce the same effect.
161163

162164
In addition to flattened data, it's important to pull in only searchable data. Searchable data is alphanumeric. Cognitive Search cannot search over binary data in any format, although it can extract and infer text descriptions of image files (see [AI enrichment](cognitive-search-concept-intro.md)) to create searchable content. Likewise, using AI enrichment, large text can be analyzed by natural language models to find structure or relevant information, generating new content that you can add to a search document.
163165

164166
Given that indexers don't fix data problems, other forms of data cleansing or manipulation might be needed. For more information, you should refer to the product documentation of your [Azure database product](../index.yml?product=databases).
165167

166-
## Know your index
168+
## Index preparation
167169

168170
Recall that indexers pass off the search documents to the search engine for indexing. Just as indexers have properties that determine execution behavior, an index schema has properties that profoundly affect how strings are indexed (only strings are analyzed and tokenized). Depending on analyzer assignments, indexed strings might be different from what you passed in. You can evaluate the effects of analyzers using [Analyze Text (REST)](/rest/api/searchservice/test-analyzer). For more information about analyzers, see [Analyzers for text processing](search-analyzers.md).
169171

0 commit comments

Comments
 (0)