You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/search/search-howto-index-csv-blobs.md
+4-2Lines changed: 4 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,9 +12,11 @@ ms.topic: conceptual
12
12
ms.date: 02/01/2021
13
13
---
14
14
15
-
# How to index CSV blobs using delimitedText parsing mode and Blob indexers in Azure Cognitive Search
15
+
# How to index CSV blobs and files using delimitedText parsing mode
16
16
17
-
The Azure Cognitive Search [blob indexer](search-howto-indexing-azure-blob-storage.md) provides a `delimitedText` parsing mode for CSV files that treats each line in the CSV as a separate search document. For example, given the following comma-delimited text, `delimitedText` would result in two documents in the search index:
In Azure Cognitive Search, both blob indexers and file indexers support a `delimitedText` parsing mode for CSV files that treats each line in the CSV as a separate search document. For example, given the following comma-delimited text, `delimitedText` would result in two documents in the search index:
This article shows you how to use [Azure Cognitive Search](search-what-is-azure-search.md) to index documents that have been previously encrypted within [Azure Blob Storage](../storage/blobs/storage-blobs-introduction.md) using [Azure Key Vault](../key-vault/general/overview.md). Normally, an indexer cannot extract content from encrypted files because it doesn't have access to the encryption key. However, by leveraging the [DecryptBlobFile](https://github.com/Azure-Samples/azure-search-power-skills/blob/main/Utils/DecryptBlobFile) custom skill, followed by the [DocumentExtractionSkill](cognitive-search-skill-document-extraction.md), you can provide controlled access to the key to decrypt the files and then have content extracted from them. This unlocks the ability to index these documents without compromising the encryption status of your stored documents.
18
20
19
21
Starting with previously encrypted whole documents (unstructured text) such as PDF, HTML, DOCX, and PPTX in Azure Blob Storage, this guide uses Postman and the Search REST APIs to perform the following tasks:
Copy file name to clipboardExpand all lines: articles/search/search-howto-index-json-blobs.md
+4-2Lines changed: 4 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,9 +11,11 @@ ms.service: cognitive-search
11
11
ms.topic: conceptual
12
12
ms.date: 02/01/2021
13
13
---
14
-
# How to index JSON blobs using a Blob indexer in Azure Cognitive Search
14
+
# How to index JSON blobs and files in Azure Cognitive Search
15
15
16
-
This article shows you how to [configure a blob indexer](search-howto-indexing-azure-blob-storage.md) for blobs that consist of JSON documents. JSON blobs in Azure Blob Storage commonly assume any of these forms:
This article shows you how to set JSON-specific properties for blobs or files that consist of JSON documents. JSON blobs in Azure Blob Storage or Azure File Storage commonly assume any of these forms:
17
19
18
20
+ A single JSON document
19
21
+ A JSON document containing an array of well-formed JSON elements
Copy file name to clipboardExpand all lines: articles/search/search-howto-index-one-to-many-blobs.md
+4-2Lines changed: 4 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,9 +12,11 @@ ms.topic: conceptual
12
12
ms.date: 02/01/2021
13
13
---
14
14
15
-
# Indexing blobs to produce multiple search documents
15
+
# Indexing blobs and files to produce multiple search documents
16
16
17
-
By default, a blob indexer will treat the contents of a blob as a single search document. If you want a more granular representation of the blob in a search index, you can set **parsingMode** values to create multiple search documents from one blob. The **parsingMode** values that result in many search documents include `delimitedText` (for [CSV](search-howto-index-csv-blobs.md)), and `jsonArray` or `jsonLines` (for [JSON](search-howto-index-json-blobs.md)).
By default, an indexer will treat the contents of a blob or file as a single search document. If you want a more granular representation in a search index, you can set **parsingMode** values to create multiple search documents from one blob or file. The **parsingMode** values that result in many search documents include `delimitedText` (for [CSV](search-howto-index-csv-blobs.md)), and `jsonArray` or `jsonLines` (for [JSON](search-howto-index-json-blobs.md)).
18
20
19
21
When you use any of these parsing modes, the new search documents that emerge must have unique document keys, and a problem arises in determining where that value comes from. The parent blob has at least one unique value in the form of `metadata_storage_path property`, but if it contributes that value to more than one search document, the key is no longer unique in the index.
Copy file name to clipboardExpand all lines: articles/search/search-howto-index-plaintext-blobs.md
+4-2Lines changed: 4 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,9 +11,11 @@ ms.topic: conceptual
11
11
ms.date: 02/01/2021
12
12
---
13
13
14
-
# How to index plain text blobs in Azure Cognitive Search
14
+
# How to index plain text blobs and files in Azure Cognitive Search
15
15
16
-
When using a [blob indexer](search-howto-indexing-azure-blob-storage.md) to extract searchable blob text for full text search, you can assign a parsing mode to get better indexing outcomes. By default, the indexer parses blob content as a single chunk of text. However, if all blobs contain plain text in the same encoding, you can significantly improve indexing performance by using the `text` parsing mode.
When using an indexer to extract searchable blob text or file content for full text search, you can assign a parsing mode to get better indexing outcomes. By default, the indexer parses the content as a single chunk of text. However, if all blobs and files contain plain text in the same encoding, you can significantly improve indexing performance by using the `text` parsing mode.
Copy file name to clipboardExpand all lines: articles/search/search-howto-indexing-azure-blob-storage.md
+66-42Lines changed: 66 additions & 42 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ ms.topic: how-to
12
12
ms.date: 01/17/2022
13
13
---
14
14
15
-
# Configure a Blob indexer to import data from Azure Blob Storage
15
+
# Index data from Azure Blob Storage
16
16
17
17
In Azure Cognitive Search, blob [indexers](search-indexer-overview.md) are frequently used for both [AI enrichment](cognitive-search-concept-intro.md) and text-based processing.
18
18
@@ -26,7 +26,7 @@ This article supplements [**Create an indexer**](search-howto-create-indexers.md
26
26
27
27
+[Access tiers](../storage/blobs/access-tiers-overview.md) for Blob storage include hot, cool, and archive. Only hot and cool can be accessed by search indexers.
28
28
29
-
+ Blob content cannot exceed the [indexer limits](search-limits-quotas-capacity.md#indexer-limits) for your search service tier.
29
+
+ Blob containers storing non-binary textual content for text-based indexing. This indexer also supports [AI enrichment](cognitive-search-concept-intro.md) if you have binary files. Note that blob content cannot exceed the [indexer limits](search-limits-quotas-capacity.md#indexer-limits) for your search service tier.
30
30
31
31
<aname="SupportedFormats"></a>
32
32
@@ -38,7 +38,7 @@ The Azure Cognitive Search blob indexer can extract text from the following docu
38
38
39
39
## Define the data source
40
40
41
-
A primary difference between a blob indexer and other indexers is the data source assignment. The data source definition specifies the type ("type": `"azureblob"`) and how to connect.
41
+
A primary difference between a blob indexer and other indexers is the data source assignment. The data source definition specifies "type": `"azureblob"`, a content path, and how to connect
42
42
43
43
1.[Create or update a data source](/rest/api/searchservice/create-data-source) to set its definition:
44
44
@@ -53,15 +53,17 @@ A primary difference between a blob indexer and other indexers is the data sourc
53
53
54
54
1. Set "type" to `"azureblob"` (required).
55
55
56
-
1. Set "credentials" to the connection string, as shown in the above example, or one of the alternative approaches described in the next section.
56
+
1. Set "credentials" to an Azure Storage connection string. The next section describes the supported formats.
57
57
58
-
1. Set "container" to the blob container within Azure Storage. If the container uses folders to organize content, set "query" to specify a subfolder.
58
+
1. Set "container" to the blob container, and use "query" to specify any subfolders.
59
+
60
+
A data source definition can also include additional properties for [soft deletion policies](search-howto-index-changed-deleted-blobs.md) and [field mappings](search-indexer-field-mappings.md) if field names and types are not the same.
59
61
60
62
<a name="credentials"></a>
61
63
62
64
### Supported credentials and connection strings
63
65
64
-
You can provide the credentials for the blob container in one of these ways:
66
+
Indexers can connect to a blob container using the following connections.
65
67
66
68
| Managed identity connection string |
67
69
|------------------------------------|
@@ -71,7 +73,7 @@ You can provide the credentials for the blob container in one of these ways:
| You can get the connection string from the Azure portal by navigating to the Storage Account > Settings > Keys (for Classic storage accounts) or Security + networking > Access keys (for Azure Resource Manager storage accounts). |
76
+
| You can get the connection string from the Storage account page in Azure portal by selecting **Access keys** in the left navigation pane. Make sure to select a full connection string and not just a key. |
@@ -86,24 +88,28 @@ You can provide the credentials for the blob container in one of these ways:
86
88
> [!NOTE]
87
89
> If you use SAS credentials, you will need to update the data source credentials periodically with renewed signatures to prevent their expiration. If SAS credentials expire, the indexer will fail with an error message similar to "Credentials provided in the connection string are invalid or have expired".
88
90
89
-
## Define search fields for blob data
91
+
## Add search fields to an index
90
92
91
-
A [search index](search-what-is-an-index.md) specifies the fields in a search document, attributes, and other constructs that shape the search experience. All indexers require that you specify a search index definition as the destination.
93
+
In a [search index](search-what-is-an-index.md), add fields to accept the content and metadata of your Azure blobs.
92
94
93
95
1. [Create or update an index](/rest/api/searchservice/create-index) to define search fields that will store blob content and metadata:
1. <a name="DocumentKeys"></a> Designate one string field as the document key that uniquely identifies each document. For blob content, the best candidates for a document key are metadata properties on the blob:
112
+
1. Designate one string field as the document key that uniquely identifies each document. For blob content, the best candidates for a document key are metadata properties on the blob:
107
113
108
114
+ **`metadata_storage_path`** (default). Using the full path ensures uniqueness, but the path contains `/` characters that are [invalid in a document key](/rest/api/searchservice/naming-rules). Use the [base64Encode function](search-indexer-field-mappings.md#base64EncodeFunction) to encode characters (see the example in the next section). If using the portal to define the indexer, the encoding step is built in.
109
115
@@ -118,7 +124,41 @@ A [search index](search-what-is-an-index.md) specifies the fields in a search do
118
124
119
125
1. Add more fields for any blob metadata that you want in the index. The indexer can read custom metadata properties, [standard metadata](#indexing-blob-metadata) properties, and [content-specific metadata](search-blob-metadata-properties.md) properties.
120
126
121
-
## Set field mappings
127
+
## Configure the blob indexer
128
+
129
+
Indexer configuration specifies the inputs, parameters, and properties that inform run time behaviors.
130
+
131
+
Under "configuration", you can control which blobs are indexed, and which are skipped, by the blob's file type or by setting properties on the blob themselves, causing the indexer to skip over them.
132
+
133
+
1. [Create or update an indexer](/rest/api/searchservice/create-indexer) to use the predefined data source and search index.
134
+
135
+
```http
136
+
POST https://[service name].search.windows.net/indexers?api-version=2020-06-30
137
+
{
138
+
"name" : "my-blob-indexer,
139
+
"dataSourceName" : "my-blob-datasource",
140
+
"targetIndexName" : "my-search-index",
141
+
"parameters": {
142
+
"batchSize": null,
143
+
"maxFailedItems": null,
144
+
"maxFailedItemsPerBatch": null,
145
+
"configuration:" {
146
+
"indexedFileNameExtensions" : ".pdf,.docx",
147
+
"excludedFileNameExtensions" : ".png,.jpeg"
148
+
}
149
+
},
150
+
"schedule" : { },
151
+
"fieldMappings" : [ ]
152
+
}
153
+
```
154
+
155
+
1. In the optional "configuration" section, provide any inclusion or exclusion criteria. If left unspecified, all blobs in the container are retrieved.
156
+
157
+
If both `indexedFileNameExtensions` and `excludedFileNameExtensions` parameters are present, Azure Cognitive Search first looks at `indexedFileNameExtensions`, then at `excludedFileNameExtensions`. If the same file extension is present in both lists, it will be excluded from indexing.
158
+
159
+
1. See [Create an indexer](search-howto-create-indexers.md) for more information about other properties.
160
+
161
+
### Set field mappings
122
162
123
163
Field mappings are a section in the indexer definition that maps source fields to destination fields in the search index.
124
164
@@ -131,10 +171,11 @@ Reasons for [creating an explicit field mapping](search-indexer-field-mappings.m
131
171
The following example demonstrates "metadata_storage_name" as the document key. Assume the index has a key field named "key" and another field named "fileSize" for storing the document size. [Field mappings](search-indexer-field-mappings.md) in the indexer definition establish field associations, and "metadata_storage_name" has the [base64Encode field mapping function](search-indexer-field-mappings.md#base64EncodeFunction) to handle unsupported characters.
132
172
133
173
```http
134
-
PUT /indexers/my-blob-indexer?api-version=2020-06-30
174
+
POST https://[service name].search.windows.net/indexers?api-version=2020-06-30
@@ -162,7 +203,7 @@ PUT /indexers/blob-indexer?api-version=2020-06-30
162
203
163
204
<aname="PartsOfBlobToIndex"></a>
164
205
165
-
## Set parameters
206
+
###Set parameters
166
207
167
208
Blob indexers include parameters that optimize indexing for specific use cases, such as content types (JSON, CSV, PDF), or to specify which parts of the blob to index.
168
209
@@ -225,49 +266,32 @@ Lastly, any metadata properties specific to the document format of the blobs you
225
266
226
267
It's important to point out that you don't need to define fields for all of the above properties in your search index - just capture the properties you need for your application.
227
268
228
-
## How blobs are indexed
229
-
230
-
By default, most blobs are indexed as a single search document in the index, including blobs with structured content, such as JSON or CSV, which are indexed as a single chunk of text. However, for JSON or CSV documents that have an internal structure (delimiters), you can assign parsing modes to generate individual search documents for each line or element. For more information, see [Indexing JSON blobs](search-howto-index-json-blobs.md) and [Indexing CSV blobs](search-howto-index-csv-blobs.md).
231
-
232
-
A compound or embedded document (such as a ZIP archive, a Word document with embedded Outlook email containing attachments, or a .MSG file with attachments) is also indexed as a single document. For example, all images extracted from the attachments of an .MSG file will be returned in the normalized_images field within the same search document.
233
-
234
-
<aname="WhichBlobsAreIndexed"></a>
235
-
236
269
## How to control which blobs are indexed
237
270
238
271
You can control which blobs are indexed, and which are skipped, by the blob's file type or by setting properties on the blob themselves, causing the indexer to skip over them.
239
272
240
-
### Include specific file extensions
241
-
242
-
Use "indexedFileNameExtensions" to provide a comma-separated list of file extensions to index (with a leading dot). For example, to index only the .PDF and .DOCX blobs, do this:
273
+
Include specific file extensions by setting `"indexedFileNameExtensions"` to a comma-separated list of file extensions (with a leading dot). Exclude specific file extensions by setting `"excludedFileNameExtensions"` to the extensions that should be skipped. If the same extension is in both lists, it will be excluded from indexing.
243
274
244
275
```http
245
276
PUT /indexers/[indexer name]?api-version=2020-06-30
246
277
{
247
278
"parameters" : {
248
279
"configuration" : {
249
-
"indexedFileNameExtensions" : ".pdf, .docx"
280
+
"indexedFileNameExtensions" : ".pdf, .docx",
281
+
"excludedFileNameExtensions" : ".png, .jpeg"
250
282
}
251
283
}
252
284
}
253
285
```
254
286
255
-
### Exclude specific file extensions
287
+
##How blobs are indexed
256
288
257
-
Use "excludedFileNameExtensions" to provide a comma-separated list of file extensions to skip (again, with a leading dot). For example, to index all blobs except those with the .PNG and .JPEG extensions, do this:
289
+
By default, most blobs are indexed as a single search document in the index, including blobs with structured content, such as JSON or CSV, which are indexed as a single chunk of text. However, for JSON or CSV documents that have an internal structure (delimiters), you can assign parsing modes to generate individual search documents for each line or element:
258
290
259
-
```http
260
-
PUT /indexers/[indexer name]?api-version=2020-06-30
If both "indexedFileNameExtensions" and "excludedFileNameExtensions" parameters are present, the indexer first looks at "indexedFileNameExtensions", then at "excludedFileNameExtensions". If the same file extension is in both lists, it will be excluded from indexing.
294
+
A compound or embedded document (such as a ZIP archive, a Word document with embedded Outlook email containing attachments, or a .MSG file with attachments) is also indexed as a single document. For example, all images extracted from the attachments of an .MSG file will be returned in the normalized_images field within the same search document.
0 commit comments