Skip to content

Commit 416c09a

Browse files
authored
Document handling for Markdown re-indexing
1 parent 82aaafe commit 416c09a

File tree

1 file changed

+39
-0
lines changed

1 file changed

+39
-0
lines changed

articles/search/search-how-to-index-markdown-blobs.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -442,6 +442,45 @@ The resulting search document in the index would look as follows:
442442

443443
> [!NOTE]
444444
> These examples specify how to use these parsing modes entirely with or without field mappings, but you can leverage both in one scenario if that suits your needs.
445+
>
446+
447+
## Managing Stale Documents from Markdown Re-indexing
448+
449+
When working with Markdown files in Azure AI Search, it's important to understand how deletions are handled during re-indexing. The indexer does not automatically delete previously indexed documents when a file is modified or sections are removed. This can lead to duplicate or stale documents remaining in the index.
450+
451+
### Behavior overview
452+
453+
* **No automatic deletion**: If sections are removed from a Markdown file and the file is re-indexed, the indexer will overwrite existing documents with new ones. However, it does not delete documents that no longer correspond to any content in the updated file.
454+
* **Potential for duplicates**: This behavior can result in duplicate documents, especially in `oneToMany` parsing mode where each section becomes a separate document. This issue typically arises only when more Markdown sections are deleted than inserted between indexing runs. In such cases, the index retains documents from the previous version that no longer match any current content, leading to stale entries.
455+
### Workaround options
456+
457+
To ensure the index reflects the current state of your Markdown files, consider one of the following approaches:
458+
459+
#### Option 1. Use the delete API
460+
461+
Before re-indexing a modified Markdown file, explicitly delete the existing documents associated with that file using the https://learn.microsoft.com/en-us/rest/api/searchservice/delete-documents.
462+
Steps:
463+
464+
1. Identify the id of the documents associated with the file. Use a query like the one below to retrieve the document IDs (e.g., `id`, `chunk_id`, etc.) for all documents tied to a specific file. Replace `metadata_storage_path` with the appropriate field in your index that maps to the file path or blob URI.
465+
```
466+
GET https://<search-service>.search.windows.net/indexes/<index-name>/docs?api-version=2025-05-01-preview
467+
Content-Type: application/json
468+
api-key: [admin key]
469+
470+
{
471+
"filter": "metadata_storage_path eq 'https://<storage-account>.blob.core.windows.net/<container-name>/<file-name>.md'",
472+
"select": "id"
473+
}
474+
```
475+
2. Issue a delete request for those documents.
476+
3. Re-index the updated file.
477+
478+
#### Option 2. Soft delete with metadata
479+
If identifying stale documents in the index is difficult, you can use a soft-delete approach:
480+
481+
1. Mark the blob as deleted by setting a metadata field (e.g., deleted=true).
482+
2. Let the indexer run. It will delete all documents in the index associated with that blob.
483+
3. Remove the soft-delete marker and re-index the file.
445484

446485
## Next steps
447486

0 commit comments

Comments
 (0)