Skip to content

Commit 9b2e15c

Browse files
authored
Merge pull request #185211 from HeidiSteen/heidist-fresh
[azure search] Updates to Azure Files indexer
2 parents 266edae + f320748 commit 9b2e15c

8 files changed

+214
-177
lines changed

articles/search/search-file-storage-integration.md

Lines changed: 92 additions & 87 deletions
Large diffs are not rendered by default.

articles/search/search-howto-index-csv-blobs.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,11 @@ ms.topic: conceptual
1212
ms.date: 02/01/2021
1313
---
1414

15-
# How to index CSV blobs using delimitedText parsing mode and Blob indexers in Azure Cognitive Search
15+
# How to index CSV blobs and files using delimitedText parsing mode
1616

17-
The Azure Cognitive Search [blob indexer](search-howto-indexing-azure-blob-storage.md) provides a `delimitedText` parsing mode for CSV files that treats each line in the CSV as a separate search document. For example, given the following comma-delimited text, `delimitedText` would result in two documents in the search index:
17+
**Applies to**: [Blob indexers](search-howto-indexing-azure-blob-storage.md), [File indexers](search-file-storage-integration.md)
18+
19+
In Azure Cognitive Search, both blob indexers and file indexers support a `delimitedText` parsing mode for CSV files that treats each line in the CSV as a separate search document. For example, given the following comma-delimited text, `delimitedText` would result in two documents in the search index:
1820

1921
```text
2022
id, datePublished, tags

articles/search/search-howto-index-encrypted-blobs.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ ms.date: 11/19/2021
1414

1515
# How to index encrypted blobs using blob indexers and skillsets in Azure Cognitive Search
1616

17+
**Applies to**: [Blob indexers](search-howto-indexing-azure-blob-storage.md), [File indexers](search-file-storage-integration.md)
18+
1719
This article shows you how to use [Azure Cognitive Search](search-what-is-azure-search.md) to index documents that have been previously encrypted within [Azure Blob Storage](../storage/blobs/storage-blobs-introduction.md) using [Azure Key Vault](../key-vault/general/overview.md). Normally, an indexer cannot extract content from encrypted files because it doesn't have access to the encryption key. However, by leveraging the [DecryptBlobFile](https://github.com/Azure-Samples/azure-search-power-skills/blob/main/Utils/DecryptBlobFile) custom skill, followed by the [DocumentExtractionSkill](cognitive-search-skill-document-extraction.md), you can provide controlled access to the key to decrypt the files and then have content extracted from them. This unlocks the ability to index these documents without compromising the encryption status of your stored documents.
1820

1921
Starting with previously encrypted whole documents (unstructured text) such as PDF, HTML, DOCX, and PPTX in Azure Blob Storage, this guide uses Postman and the Search REST APIs to perform the following tasks:

articles/search/search-howto-index-json-blobs.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,11 @@ ms.service: cognitive-search
1111
ms.topic: conceptual
1212
ms.date: 02/01/2021
1313
---
14-
# How to index JSON blobs using a Blob indexer in Azure Cognitive Search
14+
# How to index JSON blobs and files in Azure Cognitive Search
1515

16-
This article shows you how to [configure a blob indexer](search-howto-indexing-azure-blob-storage.md) for blobs that consist of JSON documents. JSON blobs in Azure Blob Storage commonly assume any of these forms:
16+
**Applies to**: [Blob indexers](search-howto-indexing-azure-blob-storage.md), [File indexers](search-file-storage-integration.md)
17+
18+
This article shows you how to set JSON-specific properties for blobs or files that consist of JSON documents. JSON blobs in Azure Blob Storage or Azure File Storage commonly assume any of these forms:
1719

1820
+ A single JSON document
1921
+ A JSON document containing an array of well-formed JSON elements

articles/search/search-howto-index-one-to-many-blobs.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,11 @@ ms.topic: conceptual
1212
ms.date: 02/01/2021
1313
---
1414

15-
# Indexing blobs to produce multiple search documents
15+
# Indexing blobs and files to produce multiple search documents
1616

17-
By default, a blob indexer will treat the contents of a blob as a single search document. If you want a more granular representation of the blob in a search index, you can set **parsingMode** values to create multiple search documents from one blob. The **parsingMode** values that result in many search documents include `delimitedText` (for [CSV](search-howto-index-csv-blobs.md)), and `jsonArray` or `jsonLines` (for [JSON](search-howto-index-json-blobs.md)).
17+
**Applies to**: [Blob indexers](search-howto-indexing-azure-blob-storage.md), [File indexers](search-file-storage-integration.md)
18+
19+
By default, an indexer will treat the contents of a blob or file as a single search document. If you want a more granular representation in a search index, you can set **parsingMode** values to create multiple search documents from one blob or file. The **parsingMode** values that result in many search documents include `delimitedText` (for [CSV](search-howto-index-csv-blobs.md)), and `jsonArray` or `jsonLines` (for [JSON](search-howto-index-json-blobs.md)).
1820

1921
When you use any of these parsing modes, the new search documents that emerge must have unique document keys, and a problem arises in determining where that value comes from. The parent blob has at least one unique value in the form of `metadata_storage_path property`, but if it contributes that value to more than one search document, the key is no longer unique in the index.
2022

articles/search/search-howto-index-plaintext-blobs.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,11 @@ ms.topic: conceptual
1111
ms.date: 02/01/2021
1212
---
1313

14-
# How to index plain text blobs in Azure Cognitive Search
14+
# How to index plain text blobs and files in Azure Cognitive Search
1515

16-
When using a [blob indexer](search-howto-indexing-azure-blob-storage.md) to extract searchable blob text for full text search, you can assign a parsing mode to get better indexing outcomes. By default, the indexer parses blob content as a single chunk of text. However, if all blobs contain plain text in the same encoding, you can significantly improve indexing performance by using the `text` parsing mode.
16+
**Applies to**: [Blob indexers](search-howto-indexing-azure-blob-storage.md), [File indexers](search-file-storage-integration.md)
17+
18+
When using an indexer to extract searchable blob text or file content for full text search, you can assign a parsing mode to get better indexing outcomes. By default, the indexer parses the content as a single chunk of text. However, if all blobs and files contain plain text in the same encoding, you can significantly improve indexing performance by using the `text` parsing mode.
1719

1820
Recommendations for use `text` parsing include:
1921

articles/search/search-howto-indexing-azure-blob-storage.md

Lines changed: 66 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ ms.topic: how-to
1212
ms.date: 01/17/2022
1313
---
1414

15-
# Configure a Blob indexer to import data from Azure Blob Storage
15+
# Index data from Azure Blob Storage
1616

1717
In Azure Cognitive Search, blob [indexers](search-indexer-overview.md) are frequently used for both [AI enrichment](cognitive-search-concept-intro.md) and text-based processing.
1818

@@ -26,7 +26,7 @@ This article supplements [**Create an indexer**](search-howto-create-indexers.md
2626

2727
+ [Access tiers](../storage/blobs/access-tiers-overview.md) for Blob storage include hot, cool, and archive. Only hot and cool can be accessed by search indexers.
2828

29-
+ Blob content cannot exceed the [indexer limits](search-limits-quotas-capacity.md#indexer-limits) for your search service tier.
29+
+ Blob containers storing non-binary textual content for text-based indexing. This indexer also supports [AI enrichment](cognitive-search-concept-intro.md) if you have binary files. Note that blob content cannot exceed the [indexer limits](search-limits-quotas-capacity.md#indexer-limits) for your search service tier.
3030

3131
<a name="SupportedFormats"></a>
3232

@@ -38,7 +38,7 @@ The Azure Cognitive Search blob indexer can extract text from the following docu
3838

3939
## Define the data source
4040

41-
A primary difference between a blob indexer and other indexers is the data source assignment. The data source definition specifies the type ("type": `"azureblob"`) and how to connect.
41+
A primary difference between a blob indexer and other indexers is the data source assignment. The data source definition specifies "type": `"azureblob"`, a content path, and how to connect
4242

4343
1. [Create or update a data source](/rest/api/searchservice/create-data-source) to set its definition:
4444

@@ -53,15 +53,17 @@ A primary difference between a blob indexer and other indexers is the data sourc
5353

5454
1. Set "type" to `"azureblob"` (required).
5555

56-
1. Set "credentials" to the connection string, as shown in the above example, or one of the alternative approaches described in the next section.
56+
1. Set "credentials" to an Azure Storage connection string. The next section describes the supported formats.
5757

58-
1. Set "container" to the blob container within Azure Storage. If the container uses folders to organize content, set "query" to specify a subfolder.
58+
1. Set "container" to the blob container, and use "query" to specify any subfolders.
59+
60+
A data source definition can also include additional properties for [soft deletion policies](search-howto-index-changed-deleted-blobs.md) and [field mappings](search-indexer-field-mappings.md) if field names and types are not the same.
5961

6062
<a name="credentials"></a>
6163

6264
### Supported credentials and connection strings
6365

64-
You can provide the credentials for the blob container in one of these ways:
66+
Indexers can connect to a blob container using the following connections.
6567

6668
| Managed identity connection string |
6769
|------------------------------------|
@@ -71,7 +73,7 @@ You can provide the credentials for the blob container in one of these ways:
7173
| Full access storage account connection string |
7274
|-----------------------------------------------|
7375
|`{ "connectionString" : "DefaultEndpointsProtocol=https;AccountName=<your storage account>;AccountKey=<your account key>;" }` |
74-
| You can get the connection string from the Azure portal by navigating to the Storage Account > Settings > Keys (for Classic storage accounts) or Security + networking > Access keys (for Azure Resource Manager storage accounts). |
76+
| You can get the connection string from the Storage account page in Azure portal by selecting **Access keys** in the left navigation pane. Make sure to select a full connection string and not just a key. |
7577

7678
| Storage account shared access signature** (SAS) connection string |
7779
|-------------------------------------------------------------------|
@@ -86,24 +88,28 @@ You can provide the credentials for the blob container in one of these ways:
8688
> [!NOTE]
8789
> If you use SAS credentials, you will need to update the data source credentials periodically with renewed signatures to prevent their expiration. If SAS credentials expire, the indexer will fail with an error message similar to "Credentials provided in the connection string are invalid or have expired".
8890

89-
## Define search fields for blob data
91+
## Add search fields to an index
9092

91-
A [search index](search-what-is-an-index.md) specifies the fields in a search document, attributes, and other constructs that shape the search experience. All indexers require that you specify a search index definition as the destination.
93+
In a [search index](search-what-is-an-index.md), add fields to accept the content and metadata of your Azure blobs.
9294

9395
1. [Create or update an index](/rest/api/searchservice/create-index) to define search fields that will store blob content and metadata:
9496

9597
```http
96-
PUT /indexes?api-version=2020-06-30
98+
POST /indexes?api-version=2020-06-30
9799
{
98-
"name" : "my-target-index",
100+
"name" : "my-search-index",
99101
"fields": [
100102
{ "name": "metadata_storage_path", "type": "Edm.String", "key": true, "searchable": false },
101-
{ "name": "content", "type": "Edm.String", "searchable": true, "filterable": false }
103+
{ "name": "content", "type": "Edm.String", "searchable": true, "filterable": false },
104+
{ "name": "metadata_storage_name", "type": "Edm.String", "searchable": false, "filterable": true, "sortable": true },
105+
{ "name": "metadata_storage_path", "type": "Edm.String", "searchable": false, "filterable": true, "sortable": true },
106+
{ "name": "metadata_storage_size", "type": "Edm.Int64", "searchable": false, "filterable": true, "sortable": true },
107+
{ "name": "metadata_storage_content_type", "type": "Edm.String", "searchable": false, "filterable": true, "sortable": true },
102108
]
103109
}
104110
```
105111

106-
1. <a name="DocumentKeys"></a> Designate one string field as the document key that uniquely identifies each document. For blob content, the best candidates for a document key are metadata properties on the blob:
112+
1. Designate one string field as the document key that uniquely identifies each document. For blob content, the best candidates for a document key are metadata properties on the blob:
107113

108114
+ **`metadata_storage_path`** (default). Using the full path ensures uniqueness, but the path contains `/` characters that are [invalid in a document key](/rest/api/searchservice/naming-rules). Use the [base64Encode function](search-indexer-field-mappings.md#base64EncodeFunction) to encode characters (see the example in the next section). If using the portal to define the indexer, the encoding step is built in.
109115

@@ -118,7 +124,41 @@ A [search index](search-what-is-an-index.md) specifies the fields in a search do
118124

119125
1. Add more fields for any blob metadata that you want in the index. The indexer can read custom metadata properties, [standard metadata](#indexing-blob-metadata) properties, and [content-specific metadata](search-blob-metadata-properties.md) properties.
120126

121-
## Set field mappings
127+
## Configure the blob indexer
128+
129+
Indexer configuration specifies the inputs, parameters, and properties that inform run time behaviors.
130+
131+
Under "configuration", you can control which blobs are indexed, and which are skipped, by the blob's file type or by setting properties on the blob themselves, causing the indexer to skip over them.
132+
133+
1. [Create or update an indexer](/rest/api/searchservice/create-indexer) to use the predefined data source and search index.
134+
135+
```http
136+
POST https://[service name].search.windows.net/indexers?api-version=2020-06-30
137+
{
138+
"name" : "my-blob-indexer,
139+
"dataSourceName" : "my-blob-datasource",
140+
"targetIndexName" : "my-search-index",
141+
"parameters": {
142+
"batchSize": null,
143+
"maxFailedItems": null,
144+
"maxFailedItemsPerBatch": null,
145+
"configuration:" {
146+
"indexedFileNameExtensions" : ".pdf,.docx",
147+
"excludedFileNameExtensions" : ".png,.jpeg"
148+
}
149+
},
150+
"schedule" : { },
151+
"fieldMappings" : [ ]
152+
}
153+
```
154+
155+
1. In the optional "configuration" section, provide any inclusion or exclusion criteria. If left unspecified, all blobs in the container are retrieved.
156+
157+
If both `indexedFileNameExtensions` and `excludedFileNameExtensions` parameters are present, Azure Cognitive Search first looks at `indexedFileNameExtensions`, then at `excludedFileNameExtensions`. If the same file extension is present in both lists, it will be excluded from indexing.
158+
159+
1. See [Create an indexer](search-howto-create-indexers.md) for more information about other properties.
160+
161+
### Set field mappings
122162

123163
Field mappings are a section in the indexer definition that maps source fields to destination fields in the search index.
124164

@@ -131,10 +171,11 @@ Reasons for [creating an explicit field mapping](search-indexer-field-mappings.m
131171
The following example demonstrates "metadata_storage_name" as the document key. Assume the index has a key field named "key" and another field named "fileSize" for storing the document size. [Field mappings](search-indexer-field-mappings.md) in the indexer definition establish field associations, and "metadata_storage_name" has the [base64Encode field mapping function](search-indexer-field-mappings.md#base64EncodeFunction) to handle unsupported characters.
132172

133173
```http
134-
PUT /indexers/my-blob-indexer?api-version=2020-06-30
174+
POST https://[service name].search.windows.net/indexers?api-version=2020-06-30
135175
{
176+
"name" : "my-blob-indexer",
136177
"dataSourceName" : "my-blob-datasource ",
137-
"targetIndexName" : "my-target-index",
178+
"targetIndexName" : "my-search-index",
138179
"schedule" : { "interval" : "PT2H" },
139180
"fieldMappings" : [
140181
{ "sourceFieldName" : "metadata_storage_name", "targetFieldName" : "key", "mappingFunction" : { "name" : "base64Encode" } },
@@ -162,7 +203,7 @@ PUT /indexers/blob-indexer?api-version=2020-06-30
162203

163204
<a name="PartsOfBlobToIndex"></a>
164205

165-
## Set parameters
206+
### Set parameters
166207

167208
Blob indexers include parameters that optimize indexing for specific use cases, such as content types (JSON, CSV, PDF), or to specify which parts of the blob to index.
168209

@@ -225,49 +266,32 @@ Lastly, any metadata properties specific to the document format of the blobs you
225266

226267
It's important to point out that you don't need to define fields for all of the above properties in your search index - just capture the properties you need for your application.
227268

228-
## How blobs are indexed
229-
230-
By default, most blobs are indexed as a single search document in the index, including blobs with structured content, such as JSON or CSV, which are indexed as a single chunk of text. However, for JSON or CSV documents that have an internal structure (delimiters), you can assign parsing modes to generate individual search documents for each line or element. For more information, see [Indexing JSON blobs](search-howto-index-json-blobs.md) and [Indexing CSV blobs](search-howto-index-csv-blobs.md).
231-
232-
A compound or embedded document (such as a ZIP archive, a Word document with embedded Outlook email containing attachments, or a .MSG file with attachments) is also indexed as a single document. For example, all images extracted from the attachments of an .MSG file will be returned in the normalized_images field within the same search document.
233-
234-
<a name="WhichBlobsAreIndexed"></a>
235-
236269
## How to control which blobs are indexed
237270

238271
You can control which blobs are indexed, and which are skipped, by the blob's file type or by setting properties on the blob themselves, causing the indexer to skip over them.
239272

240-
### Include specific file extensions
241-
242-
Use "indexedFileNameExtensions" to provide a comma-separated list of file extensions to index (with a leading dot). For example, to index only the .PDF and .DOCX blobs, do this:
273+
Include specific file extensions by setting `"indexedFileNameExtensions"` to a comma-separated list of file extensions (with a leading dot). Exclude specific file extensions by setting `"excludedFileNameExtensions"` to the extensions that should be skipped. If the same extension is in both lists, it will be excluded from indexing.
243274

244275
```http
245276
PUT /indexers/[indexer name]?api-version=2020-06-30
246277
{
247278
"parameters" : {
248279
"configuration" : {
249-
"indexedFileNameExtensions" : ".pdf, .docx"
280+
"indexedFileNameExtensions" : ".pdf, .docx",
281+
"excludedFileNameExtensions" : ".png, .jpeg"
250282
}
251283
}
252284
}
253285
```
254286

255-
### Exclude specific file extensions
287+
## How blobs are indexed
256288

257-
Use "excludedFileNameExtensions" to provide a comma-separated list of file extensions to skip (again, with a leading dot). For example, to index all blobs except those with the .PNG and .JPEG extensions, do this:
289+
By default, most blobs are indexed as a single search document in the index, including blobs with structured content, such as JSON or CSV, which are indexed as a single chunk of text. However, for JSON or CSV documents that have an internal structure (delimiters), you can assign parsing modes to generate individual search documents for each line or element:
258290

259-
```http
260-
PUT /indexers/[indexer name]?api-version=2020-06-30
261-
{
262-
"parameters" : {
263-
"configuration" : {
264-
"excludedFileNameExtensions" : ".png, .jpeg"
265-
}
266-
}
267-
}
268-
```
291+
+ [Indexing JSON blobs](search-howto-index-json-blobs.md)
292+
+ [Indexing CSV blobs](search-howto-index-csv-blobs.md).
269293

270-
If both "indexedFileNameExtensions" and "excludedFileNameExtensions" parameters are present, the indexer first looks at "indexedFileNameExtensions", then at "excludedFileNameExtensions". If the same file extension is in both lists, it will be excluded from indexing.
294+
A compound or embedded document (such as a ZIP archive, a Word document with embedded Outlook email containing attachments, or a .MSG file with attachments) is also indexed as a single document. For example, all images extracted from the attachments of an .MSG file will be returned in the normalized_images field within the same search document.
271295

272296
### Add "skip" metadata the blob
273297

0 commit comments

Comments
 (0)