You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/search/search-file-storage-integration.md
+13-46Lines changed: 13 additions & 46 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -106,9 +106,15 @@ In the [search index](search-what-is-an-index.md), add fields to accept the cont
106
106
}
107
107
```
108
108
109
-
1. Create a document key field ("key": true), but allow the indexer to populate it automatically. Do not define a field mapping to alternative unique string field.
109
+
1. Create a document key field ("key": true). For blob content, the best candidates are metadata properties. Metadata properties often include characters, such as `/` and `-`, that are invalid for document keys. Because the indexer has a "base64EncodeKeys" property (true by default), it automatically encodes the metadata property, with no configuration or field mapping required.
110
110
111
-
1. Add a "content" field to store extracted text from each file.
111
+
+ **`metadata_storage_path`** (default) full path to the object or file
112
+
113
+
+ **`metadata_storage_name`** usable only if names are unique
114
+
115
+
+ A custom metadata property that you add to blobs. This option requires that your blob upload process adds that metadata property to all blobs. Since the key is a required property, any blobs that are missing a value will fail to be indexed. If you use a custom metadata property as a key, avoid making changes to that property. Indexers will add duplicate documents for the same blob if the key property changes.
116
+
117
+
1. Add a "content" field to store extracted text from each file through the blob's "content" property. You aren't required to use this name, but doing lets you take advantage of implicit field mappings.
112
118
113
119
1. Add fields for standard metadata properties. In file indexing, the standard metadata properties are the same as blob metadata properties. The file indexer automatically creates internal field mappings for these properties that converts hyphenated property names to underscored property names. You still have to add the fields you want to use the index definition, but you can omit creating field mappings in the data source.
114
120
@@ -136,6 +142,7 @@ Indexer configuration specifies the inputs, parameters, and properties controlli
136
142
"batchSize": null,
137
143
"maxFailedItems": null,
138
144
"maxFailedItemsPerBatch": null,
145
+
"base64EncodeKeys": null,
139
146
"configuration:" {
140
147
"indexedFileNameExtensions" : ".pdf,.docx",
141
148
"excludedFileNameExtensions" : ".png,.jpeg"
@@ -152,49 +159,9 @@ Indexer configuration specifies the inputs, parameters, and properties controlli
152
159
153
160
1. See [Create an indexer](search-howto-create-indexers.md) for more information about other properties.
154
161
155
-
## Change and deletion detection
156
-
157
-
After an initial search index is created, you might want subsequent indexer jobs to pick up only new and changed documents. Fortunately, content in Azure Storage is timestamped, which gives indexers sufficient information for determining what's new and changed automatically. For search content that originates from Azure File Storage, the indexer keeps track of the file's `LastModified` timestamp and reindexes only new and changed files.
158
-
159
-
Although change detection is a given, deletion detection is not. If you want to detect deleted files, make sure to use a "soft delete" approach. If you delete the files outright in a file share, corresponding search documents will not be removed from the search index.
160
-
161
-
## Soft delete using custom metadata
162
-
163
-
This method uses a file's metadata to determine whether a search document should be removed from the index. This method requires two separate actions, deleting the search document from the index, followed by file deletion in Azure Storage.
164
-
165
-
There are steps to follow in both File storage and Cognitive Search, but there are no other feature dependencies.
166
-
167
-
1. Add a custom metadata key-value pair to the file in Azure storage to indicate to Azure Cognitive Search that it is logically deleted.
168
-
169
-
1. Configure a soft deletion column detection policy on the data source. For example, the following policy considers a file to be deleted if it has a metadata property `IsDeleted` with the value `true`:
170
-
171
-
```http
172
-
PUT https://[service name].search.windows.net/datasources/file-datasource?api-version=2020-06-30
After an indexer processes a deleted file and removes the corresponding search document from the index, it won't revisit that file if you restore it later if the file's `LastModified` timestamp is older than the last indexer run.
194
-
195
-
If you would like to reindex that document, change the `"softDeleteMarkerValue" : "false"` for that file and rerun the indexer.
162
+
## Next steps
196
163
197
-
## See also
164
+
You can now [run the indexer](search-howto-run-reset-indexers.md), [monitor status](search-howto-monitor-indexers.md), or [schedule indexer execution](search-howto-schedule-indexers). The following articles apply to indexers that pull content from Azure Storage:
198
165
199
-
+ [Indexers in Azure Cognitive Search](search-indexer-overview.md)
200
-
+ [What is Azure Files?](../storage/files/storage-files-introduction.md)
166
+
+ [Change detection and deletion detection](search-howto-index-changed-deleted-blobs.md)
167
+
+ [Index large data sets](search-howto-large-index.md)
@@ -119,10 +119,15 @@ In a [search index](search-what-is-an-index.md), add fields to accept the conten
119
119
}
120
120
```
121
121
122
-
1. Designate one string field as the document key that uniquely identifies each document. For blob content, the best candidates for a document key are metadata properties on the blob.
122
+
1. Create a document key field ("key": true). For blob content, the best candidates are metadata properties. Metadata properties often include characters, such as `/` and `-`, that are invalid for document keys. Because the indexer has a "base64EncodeKeys" property (true by default), it automatically encodes the metadata property, with no configuration or field mapping required.
123
123
124
-
1. Add a "content" field to store extracted text from each file.
125
-
<!-- 1. A **`content`** field is common to blob content. It contains the text extracted from blobs. Your definition of this field might look similar to the one above. You aren't required to use this name, but doing lets you take advantage of implicit field mappings. The blob indexer can send blob contents to a content Edm.String field in the index, with no field mappings required. -->
124
+
+ **`metadata_storage_path`** (default) full path to the object or file
125
+
126
+
+ **`metadata_storage_name`** usable only if names are unique
127
+
128
+
+ A custom metadata property that you add to blobs. This option requires that your blob upload process adds that metadata property to all blobs. Since the key is a required property, any blobs that are missing a value will fail to be indexed. If you use a custom metadata property as a key, avoid making changes to that property. Indexers will add duplicate documents for the same blob if the key property changes.
129
+
130
+
1. Add a "content" field to store extracted text from each file through the blob's "content" property. You aren't required to use this name, but doing lets you take advantage of implicit field mappings.
126
131
127
132
1. Add fields for standard metadata properties. The indexer can read custom metadata properties, [standard metadata](#indexing-blob-metadata) properties, and [content-specific metadata](search-blob-metadata-properties.md) properties.
128
133
@@ -142,6 +147,7 @@ Indexer configuration specifies the inputs, parameters, and properties controlli
142
147
"batchSize": null,
143
148
"maxFailedItems": null,
144
149
"maxFailedItemsPerBatch": null,
150
+
"base64EncodeKeys": null,
145
151
"configuration:" {
146
152
"indexedFileNameExtensions" : ".pdf,.docx",
147
153
"excludedFileNameExtensions" : ".png,.jpeg"
@@ -153,62 +159,6 @@ Indexer configuration specifies the inputs, parameters, and properties controlli
153
159
```
154
160
155
161
1. In the optional "configuration" section, provide any inclusion or exclusion criteria. If left unspecified, all blobs in the container are retrieved.
156
-
1.
157
-
```http
158
-
POST https://[service name].search.windows.net/indexers?api-version=2020-06-30
159
-
Content-Type: application/json
160
-
api-key: [admin key]
161
-
162
-
{
163
-
"name" : "adlsgen2-indexer",
164
-
"dataSourceName" : "adlsgen2-datasource",
165
-
"targetIndexName" : "my-target-index",
166
-
"schedule" : {
167
-
"interval" : "PT2H"
168
-
}
169
-
}
170
-
```
171
-
172
-
This indexer runs immediately, and then [on a schedule](search-howto-schedule-indexers.md) every two hours (schedule interval is set to "PT2H"). To run an indexer every 30 minutes, set the interval to "PT30M". The shortest supported interval is 5 minutes. The schedule is optional - if omitted, an indexer runs only once when it's created. However, you can run an indexer on-demand at any time.
173
-
174
-
<aname="DocumentKeys"></a>
175
-
176
-
## Defining document keys and field mappings
177
-
178
-
In a search index, the document key uniquely identifies each document. The field you choose must be of type `Edm.String`. For blob content, the best candidates for a document key are metadata properties on the blob.
179
-
180
-
+**`metadata_storage_name`** - this property is a candidate, but only if names are unique across all containers and folders you are indexing. Regardless of blob location, the end result is that the document key (name) must be unique in the search index after all content has been indexed.
181
-
182
-
Another potential issue about the storage name is that it might contain characters that are invalid for document keys, such as dashes. You can handle invalid characters by using the `base64Encode`[field mapping function](search-indexer-field-mappings.md#base64EncodeFunction). If you do this, remember to also encode document keys when passing them in API calls such as [Lookup Document (REST)](/rest/api/searchservice/lookup-document). In .NET, you can use the [UrlTokenEncode method](/dotnet/api/system.web.httpserverutility.urltokenencode) to encode characters.
183
-
184
-
+**`metadata_storage_path`** - using the full path ensures uniqueness, but the path definitely contains `/` characters that are [invalid in a document key](/rest/api/searchservice/naming-rules). As above, you can use the `base64Encode`[function](search-indexer-field-mappings.md#base64EncodeFunction) to encode characters.
185
-
186
-
+ A third option is to add a custom metadata property to the blobs. This option requires that your blob upload process adds that metadata property to all blobs. Since the key is a required property, any blobs that are missing a value will fail to be indexed.
187
-
188
-
> [!IMPORTANT]
189
-
> If there is no explicit mapping for the key field in the index, Azure Cognitive Search automatically uses `metadata_storage_path` as the key and base-64 encodes key values (the second option above).
190
-
>
191
-
> If you use a custom metadata property as a key, avoid making changes to that property. Indexers will add duplicate documents for the same blob if the key property changes.
192
-
193
-
### Example
194
-
195
-
The following example demonstrates `metadata_storage_name` as the document key. Assume the index has a key field named `key` and another field named `fileSize` for storing the document size. [Field mappings](search-indexer-field-mappings.md) in the indexer definition establish field associations, and `metadata_storage_name` has the [`base64Encode` field mapping function](search-indexer-field-mappings.md#base64EncodeFunction) to handle unsupported characters.
196
-
197
-
```http
198
-
PUT https://[service name].search.windows.net/indexers/adlsgen2-indexer?api-version=2020-06-30
@@ -393,10 +343,10 @@ You can also set [blob configuration properties](/rest/api/searchservice/create-
393
343
394
344
+`"indexStorageMetadataOnlyForOversizedDocuments"` to index storage metadata for blob content that is too large to process. Oversized blobs are treated as errors by default. For limits on blob size, see [service Limits](search-limits-quotas-capacity.md).
395
345
396
-
## See also
346
+
## Next steps
347
+
348
+
You can now [run the indexer](search-howto-run-reset-indexers.md), [monitor status](search-howto-monitor-indexers.md), or [schedule indexer execution](search-howto-schedule-indexers). The following articles apply to indexers that pull content from Azure Storage:
397
349
350
+
+[Change detection and deletion detection](search-howto-index-changed-deleted-blobs.md)
351
+
+[Index large data sets](search-howto-large-index.md)
398
352
+[C# Sample: Index Data Lake Gen2 using Azure AD](https://github.com/Azure-Samples/azure-search-dotnet-samples/blob/master/data-lake-gen2-acl-indexing/README.md)
399
-
+[Indexers in Azure Cognitive Search](search-indexer-overview.md)
400
-
+[Create an indexer](search-howto-create-indexers.md)
0 commit comments