Skip to content

Commit ae6f332

Browse files
authored
Merge pull request #188685 from HeidiSteen/heidist-fresh2
Updates to create indexer doc
2 parents 4b5104d + e7839f1 commit ae6f332

File tree

1 file changed

+53
-24
lines changed

1 file changed

+53
-24
lines changed

articles/search/search-howto-create-indexers.md

Lines changed: 53 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -16,19 +16,19 @@ ms.date: 01/17/2022
1616

1717
A search indexer connects to an external data source, retrieves and processes data, and then passes it to the search engine for indexing. Indexers support two workflows:
1818

19-
+ Extract text and metadata during indexing for full text search scenarios.
19+
+ Text-based indexing, extracting strings and metadata for full text search scenarios.
2020

21-
+ Apply integrated machine learning and AI models to analyze content that is not otherwise searchable, such as images and large undifferentiated text. This extended workflow is called [AI enrichment](cognitive-search-concept-intro.md) and it's indexer-driven.
21+
+ AI-enriched indexing, applying integrated machine learning and AI models to analyze content that isn't otherwise searchable, such as images and large undifferentiated text.
2222

23-
Using indexers significantly reduces the quantity and complexity of the code you need to write. This article focuses on the basics of creating an indexer. Depending on the data source and your workflow, additional configuration might be necessary.
23+
Using indexers significantly reduces the quantity and complexity of the code you need to write. This article focuses on the basics of creating an indexer. Depending on the data source and your workflow, more configuration might be necessary.
2424

2525
## Indexer definitions
2626

27-
When creating an indexer, the definition will adhere to one of two patterns: text-based indexing or AI enrichment with skills.
27+
When you create an indexer, the definition will adhere to one of two patterns: text-based indexing or AI enrichment with skills.
2828

2929
### Indexer definition for full text search
3030

31-
Full text search is the primary use case for indexers, and for this operation, an indexer uses the following properties.
31+
Full text search is the primary use case for indexers, and for this workflow, an indexer composition will be similar to the following example.
3232

3333
```json
3434
{
@@ -50,15 +50,23 @@ Full text search is the primary use case for indexers, and for this operation, a
5050
}
5151
```
5252

53-
Parameters modify run time behaviors, such as how many errors to accept before failing the entire job. The parameters above are available for all indexers and are documented in the [REST API reference](/rest/api/searchservice/create-indexer#request-body). Source-specific indexers for blobs, SQL, and Cosmos DB provide additional "configuration" parameters for source-specific behaviors. For example, if the source is Blob Storage, you can set a parameter that filters on file extensions: `"parameters" : { "configuration" : { "indexedFileNameExtensions" : ".pdf,.docx" } }`.
53+
Indexers have the following requirements:
5454

55-
Field mappings are used to explicitly map source-to-destination fields if those fields differ by name or type.
55+
+ A "name" property that uniquely identifies the indexer in the indexer collection.
56+
+ A "dataSourceName" property that points to a data source object. It specifies a connection to external data.
57+
+ A "targetIndexName" property that points to the destination search index.
58+
59+
Parameters are optional and modify run time behaviors, such as how many errors to accept before failing the entire job. The parameters above are available for all indexers and are documented in the [REST API reference](/rest/api/searchservice/create-indexer#request-body).
60+
61+
Source-specific indexers for blobs, SQL, and Cosmos DB provide extra "configuration" parameters for source-specific behaviors. For example, if the source is Blob Storage, you can set a parameter that filters on file extensions: `"parameters" : { "configuration" : { "indexedFileNameExtensions" : ".pdf,.docx" } }`.
62+
63+
[Field mappings](search-indexer-field-mappings.md) are used to explicitly map source-to-destination fields if those fields differ by name or type.
5664

5765
An indexer will run immediately when you create it on the search service. If you don't want indexer execution, set "disabled" to true.
5866

5967
You can also [specify a schedule](search-howto-schedule-indexers.md) or set an [encryption key](search-security-manage-encryption-keys.md) for supplemental encryption of the indexer definition.
6068

61-
### Indexing for AI enrichment
69+
### Indexer definition for AI enrichment
6270

6371
Indexers also drive [AI enrichment](cognitive-search-concept-intro.md). All of the above properties and parameters apply, but the following properties are specific to AI enrichment: **`skillSetName`**, **`outputFieldMappings`**, **`cache`**. A few other required and similarly named properties are added for context.
6472

@@ -82,41 +90,62 @@ AI enrichment is out of scope for this article. For more information, start with
8290

8391
## Prerequisites
8492

85-
+ Use a [supported data source](search-indexer-overview.md#supported-data-sources).
93+
+ Identify a [supported data source](search-indexer-overview.md#supported-data-sources) that contains the content you want to ingest.
8694

8795
+ [Create a search index](search-how-to-create-search-index.md) that can accept incoming data.
8896

89-
+ Have admin rights. All operations related to indexers, including GET requests for status or definitions, require an [admin api-key](search-security-api-keys.md) on the request.
97+
+ Be under the [maximum limits](search-limits-quotas-capacity.md#indexer-limits) for your service tier. The Free tier allows three objects of each type and 1-3 minutes of indexer processing or 3-10 if there's a skillset.
9098

91-
+ Be under the [maximum limits](search-limits-quotas-capacity.md#indexer-limits) for your service tier. The Free tier allows three objects of each type and 1-3 minutes of indexer processing or 3-10 if there is a skillset.
92-
93-
## Prepare data
99+
## Prepare external data
94100

95101
Indexers work with data sets. When you run an indexer, it connects to your data source, retrieves the data from the container or folder, optionally serializes it into JSON before passing it to the search engine for indexing. This section describes the requirements of incoming data for text-based indexing.
96102

97-
If your data is already JSON, the structure or shape of incoming data should correspond to the schema of your search index. Most indexes are fairly flat, where the fields collection consists of fields at the same level, but hierarchical or nested structures are possible through [complex fields and collections](search-howto-complex-data-types.md).
103+
| Source data | Tasks |
104+
|-------------|-------|
105+
| JSON documents | Make sure the structure or shape of incoming data corresponds to the schema of your search index. Most search indexes are fairly flat, where the fields collection consists of fields at the same level. However, hierarchical or nested structures are possible through [complex fields and collections](search-howto-complex-data-types.md). |
106+
| Relational | You'll need to provide it as a flattened row set, where each row becomes a full or partial search document in the index. </p>To flatten relational data into a row set, you should create a SQL view, or build a query that returns parent and child records in the same row. For example, the built-in hotels sample dataset is an SQL database that has 50 records (one for each hotel), linked to room records in a related table. The query that flattens the collective data into a row set embeds all of the room information in JSON documents in each hotel record. The embedded room information is a generated by a query that uses a **FOR JSON AUTO** clause. </p> You can learn more about this technique in [define a query that returns embedded JSON](index-sql-relational-data.md#define-a-query-that-returns-embedded-json). This is just one example; you can find other approaches that will produce the same result. |
107+
| Files | An indexer generally creates one search document for each file, where the search document consists of fields for content and metadata. Depending on the file type, the indexer can sometimes [parse one file into multiple search documents](search-howto-index-one-to-many-blobs.md). For example, in a CSV file, each row can become a standalone search document. |
98108

99-
If your data is relational, you will need to provide it as a flattened row set, where each row becomes a full or partial search document in the index. To flatten relational data into a row set, you should create a SQL view, or build a query that returns parent and child records in the same row. For example, the built-in hotels sample dataset is a SQL database that has 50 records (one for each hotel), linked to room records in a related table. The query that flattens the collective data into a row set embeds all of the room information in JSON documents in each hotel record. The embedded room information is a generated by a query that uses a **FOR JSON AUTO** clause. You can learn more about this technique in [define a query that returns embedded JSON](index-sql-relational-data.md#define-a-query-that-returns-embedded-json). This is just one example; you can find other approaches that will produce the same result.
109+
Remember that you'll only need to pull in searchable and filterable data:
100110

101-
If your data is file-based, the indexer generally creates one search document for each file, where the search document consists of fields for content and metadata. Depending on the file type, the indexer can sometimes parse one file into multiple search documents (for example, if the file is CSV and each row becomes a search document).
111+
+ Searchable data is text.
112+
+ Filterable data is alphanumeric.
102113

103-
Remember to pull in only searchable and filterable data. Searchable data is text. Filterable data is alphanumeric. Cognitive Search cannot search over binary data in any format, although it can extract and infer text descriptions of image files (see [AI enrichment](cognitive-search-concept-intro.md)) to create searchable content. Likewise, large text can be broken down and analyzed by natural language models to find structure or relevant information, generating new content that you can add to a search document.
114+
Cognitive Search can't search over binary data in any format, although it can extract and infer text descriptions of image files (see [AI enrichment](cognitive-search-concept-intro.md)) to create searchable content. Likewise, large text can be broken down and analyzed by natural language models to find structure or relevant information, generating new content that you can add to a search document.
104115

105116
Given that indexers don't fix data problems, other forms of data cleansing or manipulation might be needed. For more information, you should refer to the product documentation of your [Azure database product](../index.yml?product=databases).
106117

118+
## Prepare a data source
119+
120+
Indexers require a data source that specifies the type, location, and connection information.
121+
122+
1. Make sure you're using a [supported data source type](search-indexer-overview.md#supported-data-sources).
123+
124+
1. [Create a data source](/rest/api/searchservice/create-data-source). The following list is a few of the more frequently used data sources:
125+
126+
+ [Azure Blob Storage](search-howto-indexing-azure-blob-storage.md)
127+
+ [Azure Cosmos DB](search-howto-index-cosmosdb.md)
128+
+ [Azure SQL Database](search-howto-connecting-azure-sql-database-to-azure-search-using-indexers.md)
129+
107130
## Prepare an index
108131

109-
The output of an indexer is a search index, and the attributed fields in the index will receive the incoming data. Fields are the only receptors of external content, and depending on how the fields are attributed, the values for each field will be analyzed, tokenized, or stored as verbatim strings for filters, fuzzy search, and typeahead queries.
132+
Indexers also require a search index. Recall that indexers pass data off to the search engine for indexing. Just as indexers have properties that determine execution behavior, an index schema has properties that profoundly affect how strings are indexed (only strings are analyzed and tokenized).
133+
134+
1. Start with [Create a search index](search-how-to-create-search-index.md).
135+
136+
1. Set up the fields collection and field attributes.
137+
138+
Fields are the only receptors of external content. Depending on how the fields are attributed in the schema, the values for each field will be analyzed, tokenized, or stored as verbatim strings for filters, fuzzy search, and typeahead queries.
110139

111-
Recall that indexers pass off the search documents to the search engine for indexing. Just as indexers have properties that determine execution behavior, an index schema has properties that profoundly affect how strings are indexed (only strings are analyzed and tokenized).
140+
Indexers can automatically map source fields to target index fields when the names and types are equivalent. If a field can't be implicitly mapped, remember that you can [define an explicit field mapping](search-indexer-field-mappings.md) that tells the indexer how to route the content.
112141

113-
Depending on analyzer assignments on each field, indexed strings might be different from what you passed in. You can evaluate the effects of analyzers using [Analyze Text (REST)](/rest/api/searchservice/test-analyzer). For more information about analyzers, see [Analyzers for text processing](search-analyzers.md).
142+
1. Review the analyzer assignments on each field. Analyzers can transform strings. As such, indexed strings might be different from what you passed in. You can evaluate the effects of analyzers using [Analyze Text (REST)](/rest/api/searchservice/test-analyzer). For more information about analyzers, see [Analyzers for text processing](search-analyzers.md).
114143

115-
In terms of how indexers interact with an index, an indexer only checks field names and types. There is no validation step that ensures incoming content is correct for the corresponding search field in the index.
144+
During indexing, an indexer only checks field names and types. There's no validation step that ensures incoming content is correct for the corresponding search field in the index.
116145

117146
## Create an indexer
118147

119-
When you are ready to create an indexer on a remote search service, you will need a search client, such as Azure portal or Postman, or code that instantiates an indexer client. We recommend the Azure portal or REST APIs for early development and proof-of-concept testing.
148+
When you're ready to create an indexer on a remote search service, you'll need a search client. A search client can be the Azure portal, Postman or another REST client, or code that instantiates an indexer client. We recommend the Azure portal or REST APIs for early development and proof-of-concept testing.
120149

121150
### [**Azure portal**](#tab/portal)
122151

@@ -173,7 +202,7 @@ For Cognitive Search, the Azure SDKs implement generally available features. As
173202

174203
## Run the indexer
175204

176-
By default, an indexer runs immediately when you create it on the search service. You can override this behavior by setting "disabled" to true in the indexer definition. Indexer execution is the moment of truth where you will find out if there are data source connection errors, field mapping issues, or skillset problems.
205+
By default, an indexer runs immediately when you create it on the search service. You can override this behavior by setting "disabled" to true in the indexer definition. Indexer execution is the moment of truth where you'll find out if there are problems with connections, field mappings, or skillset construction.
177206

178207
There are several ways to run an indexer:
179208

@@ -203,7 +232,7 @@ If you need to clear the high water mark to re-index in full, you can use [Reset
203232

204233
[Monitor indexer status](search-howto-monitor-indexers.md) to check for status. Successful execution can still include warning and notifications. Be sure to check both successful and failed status notifications for details about the job.
205234

206-
For additional verification, [run queries](search-query-create.md) on the populated index that return entire documents or selected fields.
235+
For content verification, [run queries](search-query-create.md) on the populated index that return entire documents or selected fields.
207236

208237
## Next steps
209238

0 commit comments

Comments
 (0)