Skip to content

Commit 4fc13ff

Browse files
author
Mark Heffernan
committed
Native Blob Soft Delete (preview)
1 parent 49ff2fc commit 4fc13ff

File tree

1 file changed

+68
-21
lines changed

1 file changed

+68
-21
lines changed

articles/search/search-howto-indexing-azure-blob-storage.md

Lines changed: 68 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -263,7 +263,7 @@ The configuration parameters described above apply to all blobs. Sometimes, you
263263

264264
By default, the blob indexer stops as soon as it encounters a blob with an unsupported content type (for example, an image). You can of course use the `excludedFileNameExtensions` parameter to skip certain content types. However, you may need to index blobs without knowing all the possible content types in advance. To continue indexing when an unsupported content type is encountered, set the `failOnUnsupportedContentType` configuration parameter to `false`:
265265

266-
PUT https://[service name].search.windows.net/indexers/[indexer name]?api-version=2019-05-06
266+
PUT https://[service name].search.windows.net/indexers/[indexer name]?api-version=2019-05-06
267267
Content-Type: application/json
268268
api-key: [admin key]
269269

@@ -278,26 +278,69 @@ For some blobs, Azure Cognitive Search is unable to determine the content type,
278278

279279
Azure Cognitive Search limits the size of blobs that are indexed. These limits are documented in [Service Limits in Azure Cognitive Search](https://docs.microsoft.com/azure/search/search-limits-quotas-capacity). Oversized blobs are treated as errors by default. However, you can still index storage metadata of oversized blobs if you set `indexStorageMetadataOnlyForOversizedDocuments` configuration parameter to true:
280280

281-
"parameters" : { "configuration" : { "indexStorageMetadataOnlyForOversizedDocuments" : true } }
281+
"parameters" : { "configuration" : { "indexStorageMetadataOnlyForOversizedDocuments" : true } }
282282

283283
You can also continue indexing if errors happen at any point of processing, either while parsing blobs or while adding documents to an index. To ignore a specific number of errors, set the `maxFailedItems` and `maxFailedItemsPerBatch` configuration parameters to the desired values. For example:
284284

285-
{
286-
... other parts of indexer definition
287-
"parameters" : { "maxFailedItems" : 10, "maxFailedItemsPerBatch" : 10 }
288-
}
285+
{
286+
... other parts of indexer definition
287+
"parameters" : { "maxFailedItems" : 10, "maxFailedItemsPerBatch" : 10 }
288+
}
289289

290290
## Incremental indexing and deletion detection
291+
291292
When you set up a blob indexer to run on a schedule, it reindexes only the changed blobs, as determined by the blob's `LastModified` timestamp.
292293

293294
> [!NOTE]
294295
> You don't have to specify a change detection policy – incremental indexing is enabled for you automatically.
295296
296-
To support deleting documents, use a "soft delete" approach. If you delete the blobs outright, corresponding documents will not be removed from the search index. Instead, use the following steps:
297+
To support deleting documents, use a "soft delete" approach. If you delete the blobs outright, corresponding documents will not be removed from the search index.
298+
299+
There are two ways to implement the soft delete approach. Both are described below.
300+
301+
### Native blob soft delete (preview)
302+
303+
> [!IMPORTANT]
304+
> Support for native blob soft delete is in preview. Preview functionality is provided without a service level agreement, and is not recommended for production workloads. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/). The [REST API version 2019-05-06-Preview](https://docs.microsoft.com/azure/search/search-api-preview) provides this feature. There is currently no portal or .NET SDK support.
305+
306+
> [!NOTE]
307+
> When using the native blob soft delete policy the document keys for the documents in your index must either be a blob property or blob metadata.
308+
309+
In this method you will use the [native blob soft delete](https://docs.microsoft.com/azure/storage/blobs/storage-blob-soft-delete) feature offered by Azure Blob storage. If native blob soft delete is enabled on your storage account, your data source has a native soft delete policy set, and the indexer finds a blob that has been transitioned to a soft deleted state, the indexer will remove that document from the index. The native blob soft delete policy is not supported when indexing blobs from Azure Data Lake Storage Gen2.
310+
311+
Use the following steps:
312+
1. Enable [native soft delete for Azure Blob storage](https://docs.microsoft.com/azure/storage/blobs/storage-blob-soft-delete). We recommend setting the retention policy to a value that's much higher than your indexer interval schedule. This way if there's an issue running the indexer or if you have a large number of documents to index, there's plenty of time for the indexer to eventually process the soft deleted blobs. Azure Cognitive Search indexers will only delete a document from the index if it processes the blob while it's in a soft deleted state.
313+
1. Configure a native blob soft deletion detection policy on the data source. An example is shown below. Since this feature is in preview, you must use the preview REST API.
314+
1. Run the indexer or set the indexer to run on a schedule. When the indexer runs and processes the blob the document will be removed from the index.
315+
316+
```
317+
PUT https://[service name].search.windows.net/datasources/blob-datasource?api-version=2019-05-06-Preview
318+
Content-Type: application/json
319+
api-key: [admin key]
320+
{
321+
"name" : "blob-datasource",
322+
"type" : "azureblob",
323+
"credentials" : { "connectionString" : "<your storage connection string>" },
324+
"container" : { "name" : "my-container", "query" : null },
325+
"dataDeletionDetectionPolicy" : {
326+
"@odata.type" :"#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy"
327+
}
328+
}
329+
```
330+
331+
#### Reindexing undeleted blobs
297332
298-
1. Add a custom metadata property to the blob to indicate to Azure Cognitive Search that it is logically deleted
299-
2. Configure a soft deletion detection policy on the data source
300-
3. Once the indexer has processed the blob (as shown by the indexer status API), you can physically delete the blob
333+
If you delete a blob from Azure Blob storage with native soft delete enabled on your storage account the blob will transition to a soft deleted state giving you the option to undelete that blob within the retention period. When an Azure Cognitive Search data source has a native blob soft delete policy and the indexer processes a soft deleted blob it will remove that document from the index. If that blob is later undeleted the indexer will not always reindex that blob. This is because the indexer determines which blobs to index based on the blob's `LastModified` timestamp. When a soft deleted blob is undeleted its `LastModified` timestamp does not get updated, so if the indexer has already processed blobs with `LastModified` timestamps more recent than the undeleted blob it won't reindex the undeleted blob. To make sure that an undeleted blob is reindexed, you will need to update the blob's `LastModified` timestamp. One way to do this is by resaving the metadata of that blob. You don't need to change the metadata but resaving the metadata will update the blob's `LastModified` timestamp so that the indexer knows that it needs to reindex this blob.
334+
335+
### Soft delete using custom metadata
336+
337+
In this method you will use a blob's metadata to indicate when a document should be removed from the search index.
338+
339+
Use the following steps:
340+
341+
1. Add a custom metadata key-value pair to the blob to indicate to Azure Cognitive Search that it is logically deleted.
342+
1. Configure a soft deletion column detection policy on the data source. An example is shown below.
343+
1. Once the indexer has processed the blob and deleted the document from the index, you can delete the blob for Azure Blob storage.
301344
302345
For example, the following policy considers a blob to be deleted if it has a metadata property `IsDeleted` with the value `true`:
303346
@@ -309,13 +352,17 @@ For example, the following policy considers a blob to be deleted if it has a met
309352
"name" : "blob-datasource",
310353
"type" : "azureblob",
311354
"credentials" : { "connectionString" : "<your storage connection string>" },
312-
"container" : { "name" : "my-container", "query" : "my-folder" },
355+
"container" : { "name" : "my-container", "query" : null },
313356
"dataDeletionDetectionPolicy" : {
314357
"@odata.type" :"#Microsoft.Azure.Search.SoftDeleteColumnDeletionDetectionPolicy",
315358
"softDeleteColumnName" : "IsDeleted",
316359
"softDeleteMarkerValue" : "true"
317360
}
318-
}
361+
}
362+
363+
#### Reindexing undeleted blobs
364+
365+
If you set a soft delete column detection policy on your data source, then add the custom metadata to a blob with the marker value, then run the indexer, the indexer will remove that document from the index. If you would like to reindex that document, simply change the soft delete metadata value for that blob and rerun the indexer.
319366
320367
## Indexing large datasets
321368
@@ -324,14 +371,14 @@ Indexing blobs can be a time-consuming process. In cases where you have millions
324371
- Partition your data into multiple blob containers or virtual folders
325372
- Set up several Azure Cognitive Search data sources, one per container or folder. To point to a blob folder, use the `query` parameter:
326373
327-
```
328-
{
329-
"name" : "blob-datasource",
330-
"type" : "azureblob",
331-
"credentials" : { "connectionString" : "<your storage connection string>" },
332-
"container" : { "name" : "my-container", "query" : "my-folder" }
333-
}
334-
```
374+
```
375+
{
376+
"name" : "blob-datasource",
377+
"type" : "azureblob",
378+
"credentials" : { "connectionString" : "<your storage connection string>" },
379+
"container" : { "name" : "my-container", "query" : "my-folder" }
380+
}
381+
```
335382
336383
- Create a corresponding indexer for each data source. All the indexers can point to the same target search index.
337384
@@ -359,7 +406,7 @@ If all your blobs contain plain text in the same encoding, you can significantly
359406
360407
By default, the `UTF-8` encoding is assumed. To specify a different encoding, use the `encoding` configuration property:
361408
362-
{
409+
{
363410
... other parts of indexer definition
364411
"parameters" : { "configuration" : { "parsingMode" : "text", "encoding" : "windows-1252" } }
365412
}

0 commit comments

Comments
 (0)