Skip to content

Commit e29f50b

Browse files
committed
checkpoint
1 parent 436c341 commit e29f50b

File tree

4 files changed

+115
-170
lines changed

4 files changed

+115
-170
lines changed

articles/search/search-file-storage-integration.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,11 +157,15 @@ Indexer configuration specifies the inputs, parameters, and properties controlli
157157

158158
If both `indexedFileNameExtensions` and `excludedFileNameExtensions` parameters are present, Azure Cognitive Search first looks at `indexedFileNameExtensions`, then at `excludedFileNameExtensions`. If the same file extension is present in both lists, it will be excluded from indexing.
159159

160+
1. [Specify field mappings](search-indexer-field-mappings.md) if there are differences in field name or type, or if you need multiple versions of a source field in the search index.
161+
162+
In file indexing, you can often omit field mappings because the indexer has built-in support for mapping the "content" and metadata properties to to similarly named and typed fields in an index. For metadata properties, the indexer will automatically replace hyphens `-` with underscores in the search index.
163+
160164
1. See [Create an indexer](search-howto-create-indexers.md) for more information about other properties.
161165

162166
## Next steps
163167

164-
You can now [run the indexer](search-howto-run-reset-indexers.md), [monitor status](search-howto-monitor-indexers.md), or [schedule indexer execution](search-howto-schedule-indexers). The following articles apply to indexers that pull content from Azure Storage:
168+
You can now [run the indexer](search-howto-run-reset-indexers.md), [monitor status](search-howto-monitor-indexers.md), or [schedule indexer execution](search-howto-schedule-indexers.md). The following articles apply to indexers that pull content from Azure Storage:
165169

166170
+ [Change detection and deletion detection](search-howto-index-changed-deleted-blobs.md)
167171
+ [Index large data sets](search-howto-large-index.md)

articles/search/search-howto-index-azure-data-lake-storage.md

Lines changed: 51 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,12 @@ Indexer configuration specifies the inputs, parameters, and properties controlli
160160

161161
1. In the optional "configuration" section, provide any inclusion or exclusion criteria. If left unspecified, all blobs in the container are retrieved.
162162

163+
1. [Specify field mappings](search-indexer-field-mappings.md) if there are differences in field name or type, or if you need multiple versions of a source field in the search index.
164+
165+
In blob indexing, you can often omit field mappings because the indexer has built-in support for mapping the "content" and metadata properties to to similarly named and typed fields in an index. For metadata properties, the indexer will automatically replace hyphens `-` with underscores in the search index.
166+
167+
1. See [Create an indexer](search-howto-create-indexers.md) for more information about other properties.
168+
163169
### How to make an encoded field "searchable"
164170

165171
There are times when you need to use an encoded version of a field like `metadata_storage_path` as the key, but also need that field to be searchable (without encoding) in the search index. To support both use cases, you can map `metadata_storage_path` to two fields; one for the key (encoded), and a second for a path field that we can assume is attributed as "searchable" in the index schema. The example below shows two field mappings for `metadata_storage_path`.
@@ -205,45 +211,50 @@ api-key: [admin key]
205211
}
206212
```
207213

208-
## Indexing blob content
214+
## How blobs are indexed
209215

210-
By default, blobs with structured content, such as JSON or CSV, are indexed as a single chunk of text. But if the JSON or CSV documents have an internal structure (delimiters), you can assign parsing modes to generate individual search documents for each line or element. For more information, see [Indexing JSON blobs](search-howto-index-json-blobs.md) and [Indexing CSV blobs](search-howto-index-csv-blobs.md).
216+
By default, most blobs are indexed as a single search document in the index, including blobs with structured content, such as JSON or CSV, which are indexed as a single chunk of text. However, for JSON or CSV documents that have an internal structure (delimiters), you can assign parsing modes to generate individual search documents for each line or element:
211217

212-
A compound or embedded document (such as a ZIP archive, a Word document with embedded Outlook email containing attachments, or a .MSG file with attachments) is also indexed as a single document. For example, all images extracted from the attachments of an .MSG file will be returned in the normalized_images field.
218+
+ [Indexing JSON blobs](search-howto-index-json-blobs.md)
219+
+ [Indexing CSV blobs](search-howto-index-csv-blobs.md)
213220

214-
The textual content of the document is extracted into a string field named `content`.
221+
A compound or embedded document (such as a ZIP archive, a Word document with embedded Outlook email containing attachments, or a .MSG file with attachments) is also indexed as a single document. For example, all images extracted from the attachments of an .MSG file will be returned in the normalized_images field. If you have images, consider adding [AI enrichment](cognitive-search-concept-intro.md) to get more search utility from that content.
222+
223+
Textual content of a document is extracted into a string field named "content".
215224

216225
> [!NOTE]
217-
> Azure Cognitive Search limits how much text it extracts depending on the pricing tier. The current [service limits](search-limits-quotas-capacity.md#indexer-limits) are 32,000 characters for Free tier, 64,000 for Basic, 4 million for Standard, 8 million for Standard S2, and 16 million for Standard S3. A warning is included in the indexer status response for truncated documents.
226+
> Azure Cognitive Search imposes [indexer limits](search-limits-quotas-capacity.md#indexer-limits) on how much text it extracts depending on the pricing tier. A warning will appear in the indexer status response if documents are truncated.
218227
219228
<a name="indexing-blob-metadata"></a>
220229

221230
### Indexing blob metadata
222231

223-
Indexers can also index blob metadata. First, any user-specified metadata properties can be extracted verbatim. To receive the values, you must define field in the search index of type `Edm.String`, with same name as the metadata key of the blob. For example, if a blob has a metadata key of `Sensitivity` with value `High`, you should define a field named `Sensitivity` in your search index and it will be populated with the value `High`.
232+
Blob metadata can also be indexed, and that's helpful if you think any of the standard or custom metadata properties will be useful in filters and queries.
233+
234+
User-specified metadata properties are extracted verbatim. To receive the values, you must define field in the search index of type `Edm.String`, with same name as the metadata key of the blob. For example, if a blob has a metadata key of `Sensitivity` with value `High`, you should define a field named `Sensitivity` in your search index and it will be populated with the value `High`.
224235

225-
Second, standard blob metadata properties can be extracted into the fields listed below. The blob indexer automatically creates internal field mappings for these blob metadata properties. You still have to add the fields you want to use the index definition, but you can omit creating field mappings in the indexer.
236+
Standard blob metadata properties can be extracted into similarly named and typed fields, as listed below. The blob indexer automatically creates internal field mappings for these blob metadata properties, converting the original hyphenated name ("metadata-storage-name") to an underscored equivalent name ("metadata_storage_name").
226237

227-
+ **metadata_storage_name** (`Edm.String`) - the file name of the blob. For example, if you have a blob /my-container/my-folder/subfolder/resume.pdf, the value of this field is `resume.pdf`.
238+
You still have to add the underscored fields to the index definition, but you can omit creating field mappings in the indexer because the indexer will recognize the counterpart automatically.
228239

229-
+ **metadata_storage_path** (`Edm.String`) - the full URI of the blob, including the storage account. For example, `https://myaccount.blob.core.windows.net/my-container/my-folder/subfolder/resume.pdf`
240+
+ **metadata_storage_name** (`Edm.String`) - the file name of the blob. For example, if you have a blob /my-container/my-folder/subfolder/resume.pdf, the value of this field is `resume.pdf`.
230241

231-
+ **metadata_storage_content_type** (`Edm.String`) - content type as specified by the code you used to upload the blob. For example, `application/octet-stream`.
242+
+ **metadata_storage_path** (`Edm.String`) - the full URI of the blob, including the storage account. For example, `https://myaccount.blob.core.windows.net/my-container/my-folder/subfolder/resume.pdf`
232243

233-
+ **metadata_storage_last_modified** (`Edm.DateTimeOffset`) - last modified timestamp for the blob. Azure Cognitive Search uses this timestamp to identify changed blobs, to avoid reindexing everything after the initial indexing.
244+
+ **metadata_storage_content_type** (`Edm.String`) - content type as specified by the code you used to upload the blob. For example, `application/octet-stream`.
234245

235-
+ **metadata_storage_size** (`Edm.Int64`) - blob size in bytes.
246+
+ **metadata_storage_last_modified** (`Edm.DateTimeOffset`) - last modified timestamp for the blob. Azure Cognitive Search uses this timestamp to identify changed blobs, to avoid reindexing everything after the initial indexing.
236247

237-
+ **metadata_storage_content_md5** (`Edm.String`) - MD5 hash of the blob content, if available.
248+
+ **metadata_storage_size** (`Edm.Int64`) - blob size in bytes.
238249

239-
+ **metadata_storage_sas_token** (`Edm.String`) - A temporary SAS token that can be used by [custom skills](cognitive-search-custom-skill-interface.md) to get access to the blob. This token should not be stored for later use as it might expire.
250+
+ **metadata_storage_content_md5** (`Edm.String`) - MD5 hash of the blob content, if available.
251+
252+
+ **metadata_storage_sas_token** (`Edm.String`) - A temporary SAS token that can be used by [custom skills](cognitive-search-custom-skill-interface.md) to get access to the blob. This token should not be stored for later use as it might expire.
240253

241254
Lastly, any metadata properties specific to the document format of the blobs you are indexing can also be represented in the index schema. For more information about content-specific metadata, see [Content metadata properties](search-blob-metadata-properties.md).
242255

243256
It's important to point out that you don't need to define fields for all of the above properties in your search index - just capture the properties you need for your application.
244257

245-
<a name="WhichBlobsAreIndexed"></a>
246-
247258
## How to control which blobs are indexed
248259

249260
You can control which blobs are indexed, and which are skipped, by the blob's file type or by setting properties on the blob themselves, causing the indexer to skip over them.
@@ -255,7 +266,6 @@ PUT /indexers/[indexer name]?api-version=2020-06-30
255266
{
256267
"parameters" : {
257268
"configuration" : {
258-
"indexedFileNameExtensions" : ".pdf, .docx"
259269
"indexedFileNameExtensions" : ".pdf, .docx",
260270
"excludedFileNameExtensions" : ".png, .jpeg"
261271
}
@@ -274,78 +284,51 @@ The indexer configuration parameters apply to all blobs in the container or fold
274284

275285
## Index large datasets
276286

277-
Indexing blobs can be a time-consuming process. In cases where you have millions of blobs to index, you can speed up indexing by partitioning your data and using multiple indexers to [process the data in parallel](search-howto-large-index.md#parallel-indexing). Here's how you can set this up:
287+
Indexing blobs can be a time-consuming process. In cases where you have millions of blobs to index, you can speed up indexing by partitioning your data and using multiple indexers to [process the data in parallel](search-howto-large-index.md#parallel-indexing).
278288

279289
1. Partition your data into multiple blob containers or virtual folders.
280290

281-
1. Set up several data sources, one per container or folder. To point to a blob folder, use the `query` parameter:
282-
283-
```json
284-
{
285-
"name" : "blob-datasource",
286-
"type" : "azureblob",
287-
"credentials" : { "connectionString" : "<your storage connection string>" },
288-
"container" : { "name" : "my-container", "query" : "my-folder" }
289-
}
290-
```
291-
292-
1. Create a corresponding indexer for each data source. All of the indexers should point to the same target search index.
291+
1. Set up several data sources, one per container or folder. Use the `query` parameter to specify the partition: `"container" : { "name" : "my-container", "query" : "my-folder" }`.
293292

294-
One search unit in your service can run one indexer at any given time. Creating multiple indexers as described above is only useful if they actually run in parallel.
293+
1. Create one indexer for each data source. Point them to the same target index.
295294

296-
To run multiple indexers in parallel, scale out your search service by creating an appropriate number of partitions and replicas. For example, if your search service has 6 search units (for example, 2 partitions x 3 replicas), then 6 indexers can run simultaneously, resulting in a six-fold increase in the indexing throughput. To learn more about scaling and capacity planning, see [Adjust the capacity of an Azure Cognitive Search service](search-capacity-planning.md).
295+
Make sure you have sufficient capacity. One search unit in your service can run one indexer at any given time. Partitioning data and creating multiple indexers is only useful if they can run in parallel.
297296

298297
<a name="DealingWithErrors"></a>
299298

300-
## Handling errors
299+
## Configure the response to errors
301300

302301
Errors that commonly occur during indexing include unsupported content types, missing content, or oversized blobs.
303302

304-
By default, the blob indexer stops as soon as it encounters a blob with an unsupported content type (for example, an image). You could use the `excludedFileNameExtensions` parameter to skip certain content types. However, you might want to indexing to proceed even if errors occur, and then debug individual documents later. For more information about indexer errors, see [Indexer troubleshooting guidance](search-indexer-troubleshooting.md) and [Indexer errors and warnings](cognitive-search-common-errors-warnings.md).
305-
306-
### Respond to errors
307-
308-
There are four indexer properties that control the indexer's response when errors occur. The following examples show how to set these properties in the indexer definition. If an indexer already exists, you can add these properties by editing the definition in the portal.
309-
310-
#### `"maxFailedItems"` and `"maxFailedItemsPerBatch"`
311-
312-
Continue indexing if errors happen at any point of processing, either while parsing blobs or while adding documents to an index. Set these properties to the number of acceptable failures. A value of `-1` allows processing no matter how many errors occur. Otherwise, the value is a positive integer.
313-
314-
```http
315-
PUT https://[service name].search.windows.net/indexers/[indexer name]?api-version=2020-06-30
316-
Content-Type: application/json
317-
api-key: [admin key]
318-
319-
{
320-
... other parts of indexer definition
321-
"parameters" : { "maxFailedItems" : 10, "maxFailedItemsPerBatch" : 10 }
322-
}
323-
```
303+
By default, the blob indexer stops as soon as it encounters a blob with an unsupported content type (for example, an audio file). You could use the "excludedFileNameExtensions" parameter to skip certain content types. However, you might want to indexing to proceed even if errors occur, and then debug individual documents later. For more information about indexer errors, see [Indexer troubleshooting guidance](search-indexer-troubleshooting.md) and [Indexer errors and warnings](cognitive-search-common-errors-warnings.md).
324304

325-
#### `"failOnUnsupportedContentType"` and `"failOnUnprocessableDocument"`
326-
327-
For some blobs, Azure Cognitive Search is unable to determine the content type, or unable to process a document of an otherwise supported content type. To ignore these failure conditions, set configuration parameters to `false`:
305+
There are five indexer properties that control the indexer's response when errors occur.
328306

329307
```http
330-
PUT https://[service name].search.windows.net/indexers/[indexer name]?api-version=2020-06-30
331-
Content-Type: application/json
332-
api-key: [admin key]
333-
308+
PUT /indexers/[indexer name]?api-version=2020-06-30
334309
{
335-
... other parts of indexer definition
336-
"parameters" : { "configuration" : { "failOnUnsupportedContentType" : false, "failOnUnprocessableDocument" : false } }
310+
"parameters" : {
311+
"maxFailedItems" : 10,
312+
"maxFailedItemsPerBatch" : 10,
313+
"configuration" : {
314+
"failOnUnsupportedContentType" : false,
315+
"failOnUnprocessableDocument" : false,
316+
"indexStorageMetadataOnlyForOversizedDocuments": false
317+
}
337318
}
338319
```
339320

340-
### Relax indexer constraints
341-
342-
You can also set [blob configuration properties](/rest/api/searchservice/create-indexer#blob-configuration-parameters) that effectively determine whether an error condition exists. The following property can relax constraints, suppressing errors that would otherwise occur.
343-
344-
+ `"indexStorageMetadataOnlyForOversizedDocuments"` to index storage metadata for blob content that is too large to process. Oversized blobs are treated as errors by default. For limits on blob size, see [service Limits](search-limits-quotas-capacity.md).
321+
| Parameter | Valid values | Description |
322+
|-----------|--------------|-------------|
323+
| "maxFailedItems" | -1, null or 0, positive integer | Continue indexing if errors happen at any point of processing, either while parsing blobs or while adding documents to an index. Set these properties to the number of acceptable failures. A value of `-1` allows processing no matter how many errors occur. Otherwise, the value is a positive integer. |
324+
| "maxFailedItemsPerBatch" | -1, null or 0, positive integer | Same as above, but used for batch indexing. |
325+
| "failOnUnsupportedContentType" | true or false | If the indexer is unable to determine the content type, specify whether to continue or fail the job. |
326+
|"failOnUnprocessableDocument" | true or false | If the indexer is unable to process a document of an otherwise supported content type, specify whether to continue or fail the job. |
327+
| "indexStorageMetadataOnlyForOversizedDocuments" | true or false | Oversized blobs are treated as errors by default. If you set this parameter to true, the indexer will try to index its metadata even if the content cannot be indexed. For limits on blob size, see [service Limits](search-limits-quotas-capacity.md). |
345328

346329
## Next steps
347330

348-
You can now [run the indexer](search-howto-run-reset-indexers.md), [monitor status](search-howto-monitor-indexers.md), or [schedule indexer execution](search-howto-schedule-indexers). The following articles apply to indexers that pull content from Azure Storage:
331+
You can now [run the indexer](search-howto-run-reset-indexers.md), [monitor status](search-howto-monitor-indexers.md), or [schedule indexer execution](search-howto-schedule-indexers.md). The following articles apply to indexers that pull content from Azure Storage:
349332

350333
+ [Change detection and deletion detection](search-howto-index-changed-deleted-blobs.md)
351334
+ [Index large data sets](search-howto-large-index.md)

0 commit comments

Comments
 (0)