Skip to content

Commit 436c341

Browse files
committed
checkpoint
1 parent 32b4f95 commit 436c341

6 files changed

+164
-249
lines changed

articles/search/search-file-storage-integration.md

Lines changed: 13 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -106,9 +106,15 @@ In the [search index](search-what-is-an-index.md), add fields to accept the cont
106106
}
107107
```
108108

109-
1. Create a document key field ("key": true), but allow the indexer to populate it automatically. Do not define a field mapping to alternative unique string field.
109+
1. Create a document key field ("key": true). For blob content, the best candidates are metadata properties. Metadata properties often include characters, such as `/` and `-`, that are invalid for document keys. Because the indexer has a "base64EncodeKeys" property (true by default), it automatically encodes the metadata property, with no configuration or field mapping required.
110110

111-
1. Add a "content" field to store extracted text from each file.
111+
+ **`metadata_storage_path`** (default) full path to the object or file
112+
113+
+ **`metadata_storage_name`** usable only if names are unique
114+
115+
+ A custom metadata property that you add to blobs. This option requires that your blob upload process adds that metadata property to all blobs. Since the key is a required property, any blobs that are missing a value will fail to be indexed. If you use a custom metadata property as a key, avoid making changes to that property. Indexers will add duplicate documents for the same blob if the key property changes.
116+
117+
1. Add a "content" field to store extracted text from each file through the blob's "content" property. You aren't required to use this name, but doing lets you take advantage of implicit field mappings.
112118

113119
1. Add fields for standard metadata properties. In file indexing, the standard metadata properties are the same as blob metadata properties. The file indexer automatically creates internal field mappings for these properties that converts hyphenated property names to underscored property names. You still have to add the fields you want to use the index definition, but you can omit creating field mappings in the data source.
114120

@@ -136,6 +142,7 @@ Indexer configuration specifies the inputs, parameters, and properties controlli
136142
"batchSize": null,
137143
"maxFailedItems": null,
138144
"maxFailedItemsPerBatch": null,
145+
"base64EncodeKeys": null,
139146
"configuration:" {
140147
"indexedFileNameExtensions" : ".pdf,.docx",
141148
"excludedFileNameExtensions" : ".png,.jpeg"
@@ -152,49 +159,9 @@ Indexer configuration specifies the inputs, parameters, and properties controlli
152159

153160
1. See [Create an indexer](search-howto-create-indexers.md) for more information about other properties.
154161

155-
## Change and deletion detection
156-
157-
After an initial search index is created, you might want subsequent indexer jobs to pick up only new and changed documents. Fortunately, content in Azure Storage is timestamped, which gives indexers sufficient information for determining what's new and changed automatically. For search content that originates from Azure File Storage, the indexer keeps track of the file's `LastModified` timestamp and reindexes only new and changed files.
158-
159-
Although change detection is a given, deletion detection is not. If you want to detect deleted files, make sure to use a "soft delete" approach. If you delete the files outright in a file share, corresponding search documents will not be removed from the search index.
160-
161-
## Soft delete using custom metadata
162-
163-
This method uses a file's metadata to determine whether a search document should be removed from the index. This method requires two separate actions, deleting the search document from the index, followed by file deletion in Azure Storage.
164-
165-
There are steps to follow in both File storage and Cognitive Search, but there are no other feature dependencies.
166-
167-
1. Add a custom metadata key-value pair to the file in Azure storage to indicate to Azure Cognitive Search that it is logically deleted.
168-
169-
1. Configure a soft deletion column detection policy on the data source. For example, the following policy considers a file to be deleted if it has a metadata property `IsDeleted` with the value `true`:
170-
171-
```http
172-
PUT https://[service name].search.windows.net/datasources/file-datasource?api-version=2020-06-30
173-
Content-Type: application/json
174-
api-key: [admin key]
175-
176-
{
177-
"name" : "file-datasource",
178-
"type" : "azurefile",
179-
"credentials" : { "connectionString" : "<your storage connection string>" },
180-
"container" : { "name" : "my-share", "query" : null },
181-
"dataDeletionDetectionPolicy" : {
182-
"@odata.type" :"#Microsoft.Azure.Search.SoftDeleteColumnDeletionDetectionPolicy",
183-
"softDeleteColumnName" : "IsDeleted",
184-
"softDeleteMarkerValue" : "true"
185-
}
186-
}
187-
```
188-
189-
1. Once the indexer has processed the file and deleted the document from the search index, you can delete the file in Azure Storage.
190-
191-
### Reindexing undeleted files (using custom metadata)
192-
193-
After an indexer processes a deleted file and removes the corresponding search document from the index, it won't revisit that file if you restore it later if the file's `LastModified` timestamp is older than the last indexer run.
194-
195-
If you would like to reindex that document, change the `"softDeleteMarkerValue" : "false"` for that file and rerun the indexer.
162+
## Next steps
196163

197-
## See also
164+
You can now [run the indexer](search-howto-run-reset-indexers.md), [monitor status](search-howto-monitor-indexers.md), or [schedule indexer execution](search-howto-schedule-indexers). The following articles apply to indexers that pull content from Azure Storage:
198165

199-
+ [Indexers in Azure Cognitive Search](search-indexer-overview.md)
200-
+ [What is Azure Files?](../storage/files/storage-files-introduction.md)
166+
+ [Change detection and deletion detection](search-howto-index-changed-deleted-blobs.md)
167+
+ [Index large data sets](search-howto-large-index.md)

articles/search/search-howto-index-azure-data-lake-storage.md

Lines changed: 15 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ In a [search index](search-what-is-an-index.md), add fields to accept the conten
109109
{
110110
"name" : "my-search-index",
111111
"fields": [
112-
{ "name": "metadata_storage_path", "type": "Edm.String", "key": true, "searchable": false },
112+
{ "name": "ID", "type": "Edm.String", "key": true, "searchable": false },
113113
{ "name": "content", "type": "Edm.String", "searchable": true, "filterable": false },
114114
{ "name": "metadata_storage_name", "type": "Edm.String", "searchable": false, "filterable": true, "sortable": true },
115115
{ "name": "metadata_storage_size", "type": "Edm.Int64", "searchable": false, "filterable": true, "sortable": true },
@@ -119,10 +119,15 @@ In a [search index](search-what-is-an-index.md), add fields to accept the conten
119119
}
120120
```
121121

122-
1. Designate one string field as the document key that uniquely identifies each document. For blob content, the best candidates for a document key are metadata properties on the blob.
122+
1. Create a document key field ("key": true). For blob content, the best candidates are metadata properties. Metadata properties often include characters, such as `/` and `-`, that are invalid for document keys. Because the indexer has a "base64EncodeKeys" property (true by default), it automatically encodes the metadata property, with no configuration or field mapping required.
123123

124-
1. Add a "content" field to store extracted text from each file.
125-
<!-- 1. A **`content`** field is common to blob content. It contains the text extracted from blobs. Your definition of this field might look similar to the one above. You aren't required to use this name, but doing lets you take advantage of implicit field mappings. The blob indexer can send blob contents to a content Edm.String field in the index, with no field mappings required. -->
124+
+ **`metadata_storage_path`** (default) full path to the object or file
125+
126+
+ **`metadata_storage_name`** usable only if names are unique
127+
128+
+ A custom metadata property that you add to blobs. This option requires that your blob upload process adds that metadata property to all blobs. Since the key is a required property, any blobs that are missing a value will fail to be indexed. If you use a custom metadata property as a key, avoid making changes to that property. Indexers will add duplicate documents for the same blob if the key property changes.
129+
130+
1. Add a "content" field to store extracted text from each file through the blob's "content" property. You aren't required to use this name, but doing lets you take advantage of implicit field mappings.
126131

127132
1. Add fields for standard metadata properties. The indexer can read custom metadata properties, [standard metadata](#indexing-blob-metadata) properties, and [content-specific metadata](search-blob-metadata-properties.md) properties.
128133

@@ -142,6 +147,7 @@ Indexer configuration specifies the inputs, parameters, and properties controlli
142147
"batchSize": null,
143148
"maxFailedItems": null,
144149
"maxFailedItemsPerBatch": null,
150+
"base64EncodeKeys": null,
145151
"configuration:" {
146152
"indexedFileNameExtensions" : ".pdf,.docx",
147153
"excludedFileNameExtensions" : ".png,.jpeg"
@@ -153,62 +159,6 @@ Indexer configuration specifies the inputs, parameters, and properties controlli
153159
```
154160

155161
1. In the optional "configuration" section, provide any inclusion or exclusion criteria. If left unspecified, all blobs in the container are retrieved.
156-
1.
157-
```http
158-
POST https://[service name].search.windows.net/indexers?api-version=2020-06-30
159-
Content-Type: application/json
160-
api-key: [admin key]
161-
162-
{
163-
"name" : "adlsgen2-indexer",
164-
"dataSourceName" : "adlsgen2-datasource",
165-
"targetIndexName" : "my-target-index",
166-
"schedule" : {
167-
"interval" : "PT2H"
168-
}
169-
}
170-
```
171-
172-
This indexer runs immediately, and then [on a schedule](search-howto-schedule-indexers.md) every two hours (schedule interval is set to "PT2H"). To run an indexer every 30 minutes, set the interval to "PT30M". The shortest supported interval is 5 minutes. The schedule is optional - if omitted, an indexer runs only once when it's created. However, you can run an indexer on-demand at any time.
173-
174-
<a name="DocumentKeys"></a>
175-
176-
## Defining document keys and field mappings
177-
178-
In a search index, the document key uniquely identifies each document. The field you choose must be of type `Edm.String`. For blob content, the best candidates for a document key are metadata properties on the blob.
179-
180-
+ **`metadata_storage_name`** - this property is a candidate, but only if names are unique across all containers and folders you are indexing. Regardless of blob location, the end result is that the document key (name) must be unique in the search index after all content has been indexed.
181-
182-
Another potential issue about the storage name is that it might contain characters that are invalid for document keys, such as dashes. You can handle invalid characters by using the `base64Encode` [field mapping function](search-indexer-field-mappings.md#base64EncodeFunction). If you do this, remember to also encode document keys when passing them in API calls such as [Lookup Document (REST)](/rest/api/searchservice/lookup-document). In .NET, you can use the [UrlTokenEncode method](/dotnet/api/system.web.httpserverutility.urltokenencode) to encode characters.
183-
184-
+ **`metadata_storage_path`** - using the full path ensures uniqueness, but the path definitely contains `/` characters that are [invalid in a document key](/rest/api/searchservice/naming-rules). As above, you can use the `base64Encode` [function](search-indexer-field-mappings.md#base64EncodeFunction) to encode characters.
185-
186-
+ A third option is to add a custom metadata property to the blobs. This option requires that your blob upload process adds that metadata property to all blobs. Since the key is a required property, any blobs that are missing a value will fail to be indexed.
187-
188-
> [!IMPORTANT]
189-
> If there is no explicit mapping for the key field in the index, Azure Cognitive Search automatically uses `metadata_storage_path` as the key and base-64 encodes key values (the second option above).
190-
>
191-
> If you use a custom metadata property as a key, avoid making changes to that property. Indexers will add duplicate documents for the same blob if the key property changes.
192-
193-
### Example
194-
195-
The following example demonstrates `metadata_storage_name` as the document key. Assume the index has a key field named `key` and another field named `fileSize` for storing the document size. [Field mappings](search-indexer-field-mappings.md) in the indexer definition establish field associations, and `metadata_storage_name` has the [`base64Encode` field mapping function](search-indexer-field-mappings.md#base64EncodeFunction) to handle unsupported characters.
196-
197-
```http
198-
PUT https://[service name].search.windows.net/indexers/adlsgen2-indexer?api-version=2020-06-30
199-
Content-Type: application/json
200-
api-key: [admin key]
201-
202-
{
203-
"dataSourceName" : "adlsgen2-datasource",
204-
"targetIndexName" : "my-target-index",
205-
"schedule" : { "interval" : "PT2H" },
206-
"fieldMappings" : [
207-
{ "sourceFieldName" : "metadata_storage_name", "targetFieldName" : "key", "mappingFunction" : { "name" : "base64Encode" } },
208-
{ "sourceFieldName" : "metadata_storage_size", "targetFieldName" : "fileSize" }
209-
]
210-
}
211-
```
212162

213163
### How to make an encoded field "searchable"
214164

@@ -393,10 +343,10 @@ You can also set [blob configuration properties](/rest/api/searchservice/create-
393343

394344
+ `"indexStorageMetadataOnlyForOversizedDocuments"` to index storage metadata for blob content that is too large to process. Oversized blobs are treated as errors by default. For limits on blob size, see [service Limits](search-limits-quotas-capacity.md).
395345

396-
## See also
346+
## Next steps
347+
348+
You can now [run the indexer](search-howto-run-reset-indexers.md), [monitor status](search-howto-monitor-indexers.md), or [schedule indexer execution](search-howto-schedule-indexers). The following articles apply to indexers that pull content from Azure Storage:
397349

350+
+ [Change detection and deletion detection](search-howto-index-changed-deleted-blobs.md)
351+
+ [Index large data sets](search-howto-large-index.md)
398352
+ [C# Sample: Index Data Lake Gen2 using Azure AD](https://github.com/Azure-Samples/azure-search-dotnet-samples/blob/master/data-lake-gen2-acl-indexing/README.md)
399-
+ [Indexers in Azure Cognitive Search](search-indexer-overview.md)
400-
+ [Create an indexer](search-howto-create-indexers.md)
401-
+ [AI enrichment overview](cognitive-search-concept-intro.md)
402-
+ [Search over blobs overview](search-blob-storage-integration.md)

0 commit comments

Comments
 (0)