You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: Set up an Azure Files indexer to automate indexing of file shares in Azure Cognitive Search.
5
5
manager: nitinme
6
6
author: mattmsft
7
7
ms.author: magottei
8
8
ms.service: cognitive-search
9
9
ms.topic: how-to
10
-
ms.date: 01/17/2022
10
+
ms.date: 01/19/2022
11
11
---
12
12
13
13
# Index data from Azure Files
14
14
15
15
> [!IMPORTANT]
16
16
> Azure Files indexer is currently in public preview under [Supplemental Terms of Use](https://azure.microsoft.com/support/legal/preview-supplemental-terms/). Use a [preview REST API (2020-06-30-preview or later)](search-api-preview.md) to create the indexer data source.
17
17
18
-
In this article, learn the steps for extracting content and metadata from file shares in Azure Storage and sending the content to a search index in Azure Cognitive Search. The resulting index can be queried using full text search.
18
+
Configure a [search indexer](search-indexer-overview.md) to extract content from Azure File Storage and make it searchable in Azure Cognitive Search.
19
19
20
20
This article supplements [**Create an indexer**](search-howto-create-indexers.md) with information specific to indexing files in Azure Storage.
21
21
@@ -25,7 +25,7 @@ This article supplements [**Create an indexer**](search-howto-create-indexers.md
25
25
26
26
+ An [SMB file share](../storage/files/files-smb-protocol.md) providing the source content. [NFS shares](../storage/files/files-nfs-protocol.md#support-for-azure-storage-features) are not supported.
27
27
28
-
+ Files should contain non-binary textual content for text-based indexing. This indexer also supports [AI enrichment](cognitive-search-concept-intro.md)if you have binary files.
28
+
+ Files containing text. If you have binary data, you can include [AI enrichment](cognitive-search-concept-intro.md)for image analysis.
29
29
30
30
## Supported document formats
31
31
@@ -35,7 +35,7 @@ The Azure Files indexer can extract text from the following document formats:
35
35
36
36
## Define the data source
37
37
38
-
A primary difference between a file share indexer and other indexers is the data source assignment. The data source definition specifies "type": `"azurefile"`, a content path, and how to connect.
38
+
The data source definition specifies the data source type, content path, and how to connect.
39
39
40
40
1.[Create or update a data source](/rest/api/searchservice/preview-api/create-or-update-data-source) to set its definition, using a preview API version 2020-06-30-Preview or 2021-04-30-Preview for "type": `"azurefile"`.
41
41
@@ -54,44 +54,44 @@ A primary difference between a file share indexer and other indexers is the data
54
54
55
55
1. Set "container" to the root file share, and use "query" to specify any subfolders.
56
56
57
-
A data source definition can also include additional properties for [soft deletion policies](#soft-delete-using-custom-metadata) and [field mappings](search-indexer-field-mappings.md) if field names and types are not the same.
57
+
A data source definition can also include [soft deletion policies](search-howto-index-changed-deleted-blobs.md), if you want the indexer to delete a search document when the source document is flagged for deletion.
58
58
59
59
<a name="Credentials"></a>
60
60
61
61
### Supported credentials and connection strings
62
62
63
63
Indexers can connect to a file share using the following connections.
|This connection string does not require an account key, but you must have previously configured a search service to [connect using a managed identity](search-howto-managed-identities-storage.md).|
67
69
68
-
You can get the connection string from the Storage account page in Azure portal by selecting **Access keys** in the left navigation pane. Make sure to select a full connection string and not just a key.
| You can get the connection string from the Storage account page in Azure portal by selecting **Access keys** in the left navigation pane. Make sure to select a full connection string and not just a key. |
| The SAS should have the list and read permissions on containers and objects (blobs in this case). |
72
79
73
-
This connection string requires [configuring your search service as a trusted service](search-howto-managed-identities-storage.md) under Azure Active Directory,and then granting **Reader and data access** rights to the search service in Azure Storage.
The SAS should have the list and read permissions on the file share. For more information on storage shared access signatures, see [Using Shared Access Signatures](../storage/common/storage-sas-overview.md).
| The SAS should have the list and read permissions on the container. For more information, see [Using Shared Access Signatures](../storage/common/storage-sas-overview.md). |
84
84
85
85
> [!NOTE]
86
-
> If you use SAS credentials, you will need to update the data source credentials periodically with renewed signatures to prevent their expiration. If SAS credentials expire, the indexer will fail with an error message similar to "Credentials provided in the connection string are invalid or have expired".
86
+
> If you use SAS credentials, you will need to update the data source credentials periodically with renewed signatures to prevent their expiration. If SAS credentials expire, the indexer will fail with an error message similar to "Credentials provided in the connection string are invalid or have expired".
87
87
88
88
## Add search fields to an index
89
89
90
90
In the [search index](search-what-is-an-index.md), add fields to accept the content and metadata of your Azure files.
91
91
92
-
1. [Create or update an index](/rest/api/searchservice/create-index) to define search fields that will store file content, metadata, and system properties:
92
+
1. [Create or update an index](/rest/api/searchservice/create-index) to define search fields that will store file contentand metadata:
93
93
94
-
```json
94
+
```http
95
95
POST /indexes?api-version=2020-06-30
96
96
{
97
97
"name" : "my-search-index",
@@ -106,9 +106,15 @@ In the [search index](search-what-is-an-index.md), add fields to accept the cont
106
106
}
107
107
```
108
108
109
-
1. Create a key field ("key": true) to uniquely identify each search document based on unique identifiers in the files. For this data source type, the indexer will automatically identify and encode a value for this field. No field mappings are necessary.
109
+
1. Create a document key field ("key": true). For blob content, the best candidates are metadata properties. Metadata properties often include characters, such as `/` and `-`, that are invalid for document keys. Because the indexer has a "base64EncodeKeys" property (true by default), it automatically encodes the metadata property, with no configuration or field mapping required.
110
+
111
+
+ **`metadata_storage_path`** (default) full path to the object or file
112
+
113
+
+ **`metadata_storage_name`** usable only if names are unique
110
114
111
-
1. Add a "content" field to store extracted text from each file.
115
+
+ A custom metadata property that you add to blobs. This option requires that your blob upload process adds that metadata property to all blobs. Since the key is a required property, any blobs that are missing a value will fail to be indexed. If you use a custom metadata property as a key, avoid making changes to that property. Indexers will add duplicate documents for the same blob if the key property changes.
116
+
117
+
1. Add a "content" field to store extracted text from each file through the blob's "content" property. You aren't required to use this name, but doing so lets you take advantage of implicit field mappings.
112
118
113
119
1. Add fields for standard metadata properties. In file indexing, the standard metadata properties are the same as blob metadata properties. The file indexer automatically creates internal field mappings for these properties that converts hyphenated property names to underscored property names. You still have to add the fields you want to use the index definition, but you can omit creating field mappings in the data source.
114
120
@@ -122,6 +128,8 @@ In the [search index](search-what-is-an-index.md), add fields to accept the cont
122
128
123
129
## Configure the file indexer
124
130
131
+
Indexer configuration specifies the inputs, parameters, and properties controlling run time behaviors. Under "configuration", you can specify which files are indexed by file type or by properties on the files themselves.
132
+
125
133
1. [Create or update an indexer](/rest/api/searchservice/create-indexer) to use the predefined data source and search index.
126
134
127
135
```http
@@ -134,6 +142,7 @@ In the [search index](search-what-is-an-index.md), add fields to accept the cont
134
142
"batchSize": null,
135
143
"maxFailedItems": null,
136
144
"maxFailedItemsPerBatch": null,
145
+
"base64EncodeKeys": null,
137
146
"configuration:" {
138
147
"indexedFileNameExtensions" : ".pdf,.docx",
139
148
"excludedFileNameExtensions" : ".png,.jpeg"
@@ -148,51 +157,15 @@ In the [search index](search-what-is-an-index.md), add fields to accept the cont
148
157
149
158
If both `indexedFileNameExtensions` and `excludedFileNameExtensions` parameters are present, Azure Cognitive Search first looks at `indexedFileNameExtensions`, then at `excludedFileNameExtensions`. If the same file extension is present in both lists, it will be excluded from indexing.
150
159
151
-
1. See [Create an indexer](search-howto-create-indexers.md) for more information about other properties.
152
-
153
-
## Change and deletion detection
154
-
155
-
After an initial search index is created, you might want subsequent indexer jobs to pick up only new and changed documents. Fortunately, content in Azure Storage is timestamped, which gives indexers sufficient information for determining what's new and changed automatically. For search content that originates from Azure File Storage, the indexer keeps track of the file's `LastModified` timestamp and reindexes only new and changed files.
156
-
157
-
Although change detection is a given, deletion detection is not. If you want to detect deleted files, make sure to use a "soft delete" approach. If you delete the files outright in a file share, corresponding search documents will not be removed from the search index.
158
-
159
-
## Soft delete using custom metadata
160
+
1. [Specify field mappings](search-indexer-field-mappings.md) if there are differences in field name or type, or if you need multiple versions of a source field in the search index.
160
161
161
-
This method uses a file's metadata to determine whether a search document should be removed from the index. This method requires two separate actions, deleting the search document from the index, followed by file deletion in Azure Storage.
162
+
In file indexing, you can often omit field mappings because the indexer has built-in support for mapping the "content" and metadata properties to similarly named and typed fields in an index. For metadata properties, the indexer will automatically replace hyphens `-` with underscores in the search index.
162
163
163
-
There are steps to follow in both File storage and Cognitive Search, but there are no other feature dependencies.
164
-
165
-
1. Add a custom metadata key-value pair to the file in Azure storage to indicate to Azure Cognitive Search that it is logically deleted.
166
-
167
-
1. Configure a soft deletion column detection policy on the data source. For example, the following policy considers a file to be deleted if it has a metadata property `IsDeleted` with the value `true`:
168
-
169
-
```http
170
-
PUT https://[service name].search.windows.net/datasources/file-datasource?api-version=2020-06-30
After an indexer processes a deleted file and removes the corresponding search document from the index, it won't revisit that file if you restore it later if the file's `LastModified` timestamp is older than the last indexer run.
164
+
1. See [Create an indexer](search-howto-create-indexers.md) for more information about other properties.
192
165
193
-
If you would like to reindex that document, change the `"softDeleteMarkerValue" : "false"` for that file and rerun the indexer.
166
+
## Next steps
194
167
195
-
## See also
168
+
You can now [run the indexer](search-howto-run-reset-indexers.md), [monitor status](search-howto-monitor-indexers.md), or [schedule indexer execution](search-howto-schedule-indexers.md). The following articles apply to indexers that pull content from Azure Storage:
196
169
197
-
+ [Indexers in Azure Cognitive Search](search-indexer-overview.md)
198
-
+ [What is Azure Files?](../storage/files/storage-files-introduction.md)
170
+
+ [Change detection and deletion detection](search-howto-index-changed-deleted-blobs.md)
171
+
+ [Index large data sets](search-howto-large-index.md)
0 commit comments