Skip to content

Commit e0704d1

Browse files
authored
Merge pull request #185275 from HeidiSteen/heidist-fresh3
ADLS phase 2 edits
2 parents 3fbfed2 + 2a64a36 commit e0704d1

8 files changed

+437
-596
lines changed

articles/search/TOC.yml

Lines changed: 19 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -329,30 +329,28 @@
329329
href: search-data-sources-gallery.md
330330
- name: Azure Storage
331331
items:
332-
- name: Blob Storage
333-
items:
334-
- name: Search over blobs
335-
href: search-blob-storage-integration.md
336-
- name: Configure a blob indexer
337-
href: search-howto-indexing-azure-blob-storage.md
338-
- name: Index changed and deleted blobs
339-
href: search-howto-index-changed-deleted-blobs.md
340-
- name: Index one-to-many blobs
341-
href: search-howto-index-one-to-many-blobs.md
342-
- name: Index plain text blobs
343-
href: search-howto-index-plaintext-blobs.md
344-
- name: Index CSV blobs
345-
href: search-howto-index-csv-blobs.md
346-
- name: Index JSON blobs
347-
href: search-howto-index-json-blobs.md
348-
- name: Index encrypted blobs
349-
href: search-howto-index-encrypted-blobs.md
350-
- name: Data Lake Storage Gen2
332+
- name: ADLS Gen2
351333
href: search-howto-index-azure-data-lake-storage.md
352-
- name: File Storage
334+
- name: Blobs
335+
href: search-howto-indexing-azure-blob-storage.md
336+
- name: Files
353337
href: search-file-storage-integration.md
354-
- name: Table Storage
338+
- name: Tables
355339
href: search-howto-indexing-azure-tables.md
340+
- name: Index changed and deleted content
341+
href: search-howto-index-changed-deleted-blobs.md
342+
- name: Index one-to-many
343+
href: search-howto-index-one-to-many-blobs.md
344+
- name: Index plain text
345+
href: search-howto-index-plaintext-blobs.md
346+
- name: Index CSV
347+
href: search-howto-index-csv-blobs.md
348+
- name: Index JSON
349+
href: search-howto-index-json-blobs.md
350+
- name: Index encrypted blobs
351+
href: search-howto-index-encrypted-blobs.md
352+
- name: Search over blobs
353+
href: search-blob-storage-integration.md
356354
- name: Azure Cosmos DB
357355
items:
358356
- name: SQL and MongoDB APIs
13.2 KB
Loading
Lines changed: 43 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,21 @@
11
---
2-
title: Azure Files indexing (preview)
2+
title: Azure Files indexer (preview)
33
titleSuffix: Azure Cognitive Search
44
description: Set up an Azure Files indexer to automate indexing of file shares in Azure Cognitive Search.
55
manager: nitinme
66
author: mattmsft
77
ms.author: magottei
88
ms.service: cognitive-search
99
ms.topic: how-to
10-
ms.date: 01/17/2022
10+
ms.date: 01/19/2022
1111
---
1212

1313
# Index data from Azure Files
1414

1515
> [!IMPORTANT]
1616
> Azure Files indexer is currently in public preview under [Supplemental Terms of Use](https://azure.microsoft.com/support/legal/preview-supplemental-terms/). Use a [preview REST API (2020-06-30-preview or later)](search-api-preview.md) to create the indexer data source.
1717
18-
In this article, learn the steps for extracting content and metadata from file shares in Azure Storage and sending the content to a search index in Azure Cognitive Search. The resulting index can be queried using full text search.
18+
Configure a [search indexer](search-indexer-overview.md) to extract content from Azure File Storage and make it searchable in Azure Cognitive Search.
1919

2020
This article supplements [**Create an indexer**](search-howto-create-indexers.md) with information specific to indexing files in Azure Storage.
2121

@@ -25,7 +25,7 @@ This article supplements [**Create an indexer**](search-howto-create-indexers.md
2525

2626
+ An [SMB file share](../storage/files/files-smb-protocol.md) providing the source content. [NFS shares](../storage/files/files-nfs-protocol.md#support-for-azure-storage-features) are not supported.
2727

28-
+ Files should contain non-binary textual content for text-based indexing. This indexer also supports [AI enrichment](cognitive-search-concept-intro.md) if you have binary files.
28+
+ Files containing text. If you have binary data, you can include [AI enrichment](cognitive-search-concept-intro.md) for image analysis.
2929

3030
## Supported document formats
3131

@@ -35,7 +35,7 @@ The Azure Files indexer can extract text from the following document formats:
3535

3636
## Define the data source
3737

38-
A primary difference between a file share indexer and other indexers is the data source assignment. The data source definition specifies "type": `"azurefile"`, a content path, and how to connect.
38+
The data source definition specifies the data source type, content path, and how to connect.
3939

4040
1. [Create or update a data source](/rest/api/searchservice/preview-api/create-or-update-data-source) to set its definition, using a preview API version 2020-06-30-Preview or 2021-04-30-Preview for "type": `"azurefile"`.
4141

@@ -54,44 +54,44 @@ A primary difference between a file share indexer and other indexers is the data
5454

5555
1. Set "container" to the root file share, and use "query" to specify any subfolders.
5656

57-
A data source definition can also include additional properties for [soft deletion policies](#soft-delete-using-custom-metadata) and [field mappings](search-indexer-field-mappings.md) if field names and types are not the same.
57+
A data source definition can also include [soft deletion policies](search-howto-index-changed-deleted-blobs.md), if you want the indexer to delete a search document when the source document is flagged for deletion.
5858

5959
<a name="Credentials"></a>
6060

6161
### Supported credentials and connection strings
6262

6363
Indexers can connect to a file share using the following connections.
6464

65-
**Full access storage account connection string**:
66-
`{ "connectionString" : "DefaultEndpointsProtocol=https;AccountName=<your storage account>;AccountKey=<your account key>;" }`
65+
| Managed identity connection string |
66+
|------------------------------------|
67+
|`{ "connectionString" : "ResourceId=/subscriptions/<your subscription ID>/resourceGroups/<your resource group name>/providers/Microsoft.Storage/storageAccounts/<your storage account name>/;" }`|
68+
|This connection string does not require an account key, but you must have previously configured a search service to [connect using a managed identity](search-howto-managed-identities-storage.md).|
6769

68-
You can get the connection string from the Storage account page in Azure portal by selecting **Access keys** in the left navigation pane. Make sure to select a full connection string and not just a key.
70+
| Full access storage account connection string |
71+
|-----------------------------------------------|
72+
|`{ "connectionString" : "DefaultEndpointsProtocol=https;AccountName=<your storage account>;AccountKey=<your account key>;" }` |
73+
| You can get the connection string from the Storage account page in Azure portal by selecting **Access keys** in the left navigation pane. Make sure to select a full connection string and not just a key. |
6974

70-
**Managed identity connection string**:
71-
`{ "connectionString" : "ResourceId=/subscriptions/<your subscription ID>/resourceGroups/<your resource group name>/providers/Microsoft.Storage/storageAccounts/<your storage account name>/;" }`
75+
| Storage account shared access signature** (SAS) connection string |
76+
|-------------------------------------------------------------------|
77+
| `{ "connectionString" : "BlobEndpoint=https://<your account>.blob.core.windows.net/;SharedAccessSignature=?sv=2016-05-31&sig=<the signature>&spr=https&se=<the validity end time>&srt=co&ss=b&sp=rl;" }` |
78+
| The SAS should have the list and read permissions on containers and objects (blobs in this case). |
7279

73-
This connection string requires [configuring your search service as a trusted service](search-howto-managed-identities-storage.md) under Azure Active Directory,and then granting **Reader and data access** rights to the search service in Azure Storage.
74-
75-
**Storage account shared access signature** (SAS) connection string:
76-
`{ "connectionString" : "BlobEndpoint=https://<your account>.file.core.windows.net/;SharedAccessSignature=?sv=2016-05-31&sig=<the signature>&spr=https&se=<the validity end time>&sp=rl&sr=s;" }`
77-
78-
The SAS should have the list and read permissions on file shares.
79-
80-
**Container shared access signature**:
81-
`{ "connectionString" : "ContainerSharedAccessUri=https://<your storage account>.file.core.windows.net/<share name>?sv=2016-05-31&sr=s&sig=<the signature>&se=<the validity end time>&sp=rl;" }`
82-
83-
The SAS should have the list and read permissions on the file share. For more information on storage shared access signatures, see [Using Shared Access Signatures](../storage/common/storage-sas-overview.md).
80+
| Container shared access signature |
81+
|-----------------------------------|
82+
| `{ "connectionString" : "ContainerSharedAccessUri=https://<your storage account>.blob.core.windows.net/<container name>?sv=2016-05-31&sr=c&sig=<the signature>&se=<the validity end time>&sp=rl;" }` |
83+
| The SAS should have the list and read permissions on the container. For more information, see [Using Shared Access Signatures](../storage/common/storage-sas-overview.md). |
8484

8585
> [!NOTE]
86-
> If you use SAS credentials, you will need to update the data source credentials periodically with renewed signatures to prevent their expiration. If SAS credentials expire, the indexer will fail with an error message similar to "Credentials provided in the connection string are invalid or have expired".
86+
> If you use SAS credentials, you will need to update the data source credentials periodically with renewed signatures to prevent their expiration. If SAS credentials expire, the indexer will fail with an error message similar to "Credentials provided in the connection string are invalid or have expired".
8787

8888
## Add search fields to an index
8989

9090
In the [search index](search-what-is-an-index.md), add fields to accept the content and metadata of your Azure files.
9191

92-
1. [Create or update an index](/rest/api/searchservice/create-index) to define search fields that will store file content, metadata, and system properties:
92+
1. [Create or update an index](/rest/api/searchservice/create-index) to define search fields that will store file content and metadata:
9393

94-
```json
94+
```http
9595
POST /indexes?api-version=2020-06-30
9696
{
9797
"name" : "my-search-index",
@@ -106,9 +106,15 @@ In the [search index](search-what-is-an-index.md), add fields to accept the cont
106106
}
107107
```
108108

109-
1. Create a key field ("key": true) to uniquely identify each search document based on unique identifiers in the files. For this data source type, the indexer will automatically identify and encode a value for this field. No field mappings are necessary.
109+
1. Create a document key field ("key": true). For blob content, the best candidates are metadata properties. Metadata properties often include characters, such as `/` and `-`, that are invalid for document keys. Because the indexer has a "base64EncodeKeys" property (true by default), it automatically encodes the metadata property, with no configuration or field mapping required.
110+
111+
+ **`metadata_storage_path`** (default) full path to the object or file
112+
113+
+ **`metadata_storage_name`** usable only if names are unique
110114

111-
1. Add a "content" field to store extracted text from each file.
115+
+ A custom metadata property that you add to blobs. This option requires that your blob upload process adds that metadata property to all blobs. Since the key is a required property, any blobs that are missing a value will fail to be indexed. If you use a custom metadata property as a key, avoid making changes to that property. Indexers will add duplicate documents for the same blob if the key property changes.
116+
117+
1. Add a "content" field to store extracted text from each file through the blob's "content" property. You aren't required to use this name, but doing so lets you take advantage of implicit field mappings.
112118

113119
1. Add fields for standard metadata properties. In file indexing, the standard metadata properties are the same as blob metadata properties. The file indexer automatically creates internal field mappings for these properties that converts hyphenated property names to underscored property names. You still have to add the fields you want to use the index definition, but you can omit creating field mappings in the data source.
114120

@@ -122,6 +128,8 @@ In the [search index](search-what-is-an-index.md), add fields to accept the cont
122128

123129
## Configure the file indexer
124130

131+
Indexer configuration specifies the inputs, parameters, and properties controlling run time behaviors. Under "configuration", you can specify which files are indexed by file type or by properties on the files themselves.
132+
125133
1. [Create or update an indexer](/rest/api/searchservice/create-indexer) to use the predefined data source and search index.
126134

127135
```http
@@ -134,6 +142,7 @@ In the [search index](search-what-is-an-index.md), add fields to accept the cont
134142
"batchSize": null,
135143
"maxFailedItems": null,
136144
"maxFailedItemsPerBatch": null,
145+
"base64EncodeKeys": null,
137146
"configuration:" {
138147
"indexedFileNameExtensions" : ".pdf,.docx",
139148
"excludedFileNameExtensions" : ".png,.jpeg"
@@ -148,51 +157,15 @@ In the [search index](search-what-is-an-index.md), add fields to accept the cont
148157

149158
If both `indexedFileNameExtensions` and `excludedFileNameExtensions` parameters are present, Azure Cognitive Search first looks at `indexedFileNameExtensions`, then at `excludedFileNameExtensions`. If the same file extension is present in both lists, it will be excluded from indexing.
150159

151-
1. See [Create an indexer](search-howto-create-indexers.md) for more information about other properties.
152-
153-
## Change and deletion detection
154-
155-
After an initial search index is created, you might want subsequent indexer jobs to pick up only new and changed documents. Fortunately, content in Azure Storage is timestamped, which gives indexers sufficient information for determining what's new and changed automatically. For search content that originates from Azure File Storage, the indexer keeps track of the file's `LastModified` timestamp and reindexes only new and changed files.
156-
157-
Although change detection is a given, deletion detection is not. If you want to detect deleted files, make sure to use a "soft delete" approach. If you delete the files outright in a file share, corresponding search documents will not be removed from the search index.
158-
159-
## Soft delete using custom metadata
160+
1. [Specify field mappings](search-indexer-field-mappings.md) if there are differences in field name or type, or if you need multiple versions of a source field in the search index.
160161

161-
This method uses a file's metadata to determine whether a search document should be removed from the index. This method requires two separate actions, deleting the search document from the index, followed by file deletion in Azure Storage.
162+
In file indexing, you can often omit field mappings because the indexer has built-in support for mapping the "content" and metadata properties to similarly named and typed fields in an index. For metadata properties, the indexer will automatically replace hyphens `-` with underscores in the search index.
162163

163-
There are steps to follow in both File storage and Cognitive Search, but there are no other feature dependencies.
164-
165-
1. Add a custom metadata key-value pair to the file in Azure storage to indicate to Azure Cognitive Search that it is logically deleted.
166-
167-
1. Configure a soft deletion column detection policy on the data source. For example, the following policy considers a file to be deleted if it has a metadata property `IsDeleted` with the value `true`:
168-
169-
```http
170-
PUT https://[service name].search.windows.net/datasources/file-datasource?api-version=2020-06-30
171-
Content-Type: application/json
172-
api-key: [admin key]
173-
174-
{
175-
"name" : "file-datasource",
176-
"type" : "azurefile",
177-
"credentials" : { "connectionString" : "<your storage connection string>" },
178-
"container" : { "name" : "my-share", "query" : null },
179-
"dataDeletionDetectionPolicy" : {
180-
"@odata.type" :"#Microsoft.Azure.Search.SoftDeleteColumnDeletionDetectionPolicy",
181-
"softDeleteColumnName" : "IsDeleted",
182-
"softDeleteMarkerValue" : "true"
183-
}
184-
}
185-
```
186-
187-
1. Once the indexer has processed the file and deleted the document from the search index, you can delete the file in Azure Storage.
188-
189-
### Reindexing undeleted files (using custom metadata)
190-
191-
After an indexer processes a deleted file and removes the corresponding search document from the index, it won't revisit that file if you restore it later if the file's `LastModified` timestamp is older than the last indexer run.
164+
1. See [Create an indexer](search-howto-create-indexers.md) for more information about other properties.
192165

193-
If you would like to reindex that document, change the `"softDeleteMarkerValue" : "false"` for that file and rerun the indexer.
166+
## Next steps
194167

195-
## See also
168+
You can now [run the indexer](search-howto-run-reset-indexers.md), [monitor status](search-howto-monitor-indexers.md), or [schedule indexer execution](search-howto-schedule-indexers.md). The following articles apply to indexers that pull content from Azure Storage:
196169

197-
+ [Indexers in Azure Cognitive Search](search-indexer-overview.md)
198-
+ [What is Azure Files?](../storage/files/storage-files-introduction.md)
170+
+ [Change detection and deletion detection](search-howto-index-changed-deleted-blobs.md)
171+
+ [Index large data sets](search-howto-large-index.md)

0 commit comments

Comments
 (0)