Skip to content

Commit 877bd1d

Browse files
Merge pull request #6763 from mdonovanmsft/patch-1
Document handling for Markdown re-indexing
2 parents fa1679b + 681535c commit 877bd1d

File tree

1 file changed

+97
-17
lines changed

1 file changed

+97
-17
lines changed

articles/search/search-how-to-index-markdown-blobs.md

Lines changed: 97 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ This setting can be changed after initial creation of the indexer, however the s
7878

7979
## Supported Markdown elements
8080

81-
Markdown parsing will only split content based on headers. All other elements such as lists, code blocks, tables, and so forth, are treated as plain text and passed into a content field.
81+
Markdown parsing only splits content based on headers. All other elements such as lists, code blocks, tables, and so forth, are treated as plain text and passed into a content field.
8282

8383
<a name="parsing-markdown-one-to-many"></a>
8484

@@ -99,7 +99,7 @@ Content for section 2.
9999

100100
## Use one-to-many parsing mode
101101

102-
The one-to-many parsing mode parses Markdown files into multiple search documents, where each document corresponds to a specific content section of the Markdown file based on the header metadata at that point in the document. The Markdown is parsed based on headers into search documents which contain the following content:
102+
The one-to-many parsing mode parses Markdown files into multiple search documents, where each document corresponds to a specific content section of the Markdown file based on the header metadata at that point in the document. The Markdown is parsed based on headers into search documents, which contain the following content:
103103

104104
- `content`: A string that contains the raw Markdown found in a specific location, based on the header metadata at that point in the document.
105105

@@ -169,7 +169,7 @@ api-key: [admin key]
169169
```
170170

171171
> [!NOTE]
172-
> The `submode` does not need to be set explicitly here because `oneToMany` is the default.
172+
> The `submode` doesn't need to be set explicitly here because `oneToMany` is the default.
173173
174174
### Indexer output for one-to-many parsing
175175

@@ -249,7 +249,7 @@ In the one-to-one parsing mode, the entire Markdown document is indexed as a sin
249249

250250
Within the indexer definition, set the `parsingMode` to `"markdown"` and use the optional `markdownHeaderDepth` parameter to define the maximum heading depth for chunking. If not specified, it defaults to `h6`, capturing all possible header depths.
251251

252-
The Markdown is parsed based on headers into search documents which contain the following content:
252+
The Markdown is parsed based on headers into search documents, which contain the following content:
253253

254254
- `document_content`: Contains the full Markdown text as a single string. This field serves as a raw representation of the input document.
255255

@@ -259,13 +259,13 @@ The Markdown is parsed based on headers into search documents which contain the
259259

260260
- `header_name`: A string containing the text of the header as it appears in the Markdown document. This field provides a label or title for the section.
261261

262-
- `content`: A string containing text content that immediately follows the header, up to the next header. This field captures the detailed information or description associated with the header. If there's no content directly under a header, this is an empty string.
262+
- `content`: A string containing text content that immediately follows the header, up to the next header. This field captures the detailed information or description associated with the header. If there's no content directly under a header, the value is an empty string.
263263

264264
- `ordinal_position`: An integer value indicating the position of the section within the document hierarchy. This field is used for ordering the sections in their original sequence as they appear in the document, beginning with an ordinal position of 1 and incrementing sequentially for each content block.
265265

266266
- `sections`: An array that contains objects representing subsections nested under the current section. This array follows the same structure as the top-level `sections` array, allowing for the representation of multiple levels of nested content. Each subsection object also includes `header_level`, `header_name`, `content`, and `ordinal_position` properties, enabling a recursive structure that represents and hierarchy of the Markdown content.
267267

268-
Here's the sample Markdown that we're using to explain an index schema that's designed around each parsing mode.
268+
Here's the sample Markdown that we're using to explain the index schemas designed around each parsing mode.
269269

270270
```md
271271
# Section 1
@@ -286,50 +286,56 @@ If you aren't utilizing field mappings, the shape of the index should reflect th
286286
"name": "my-markdown-index",
287287
"fields": [
288288
{
289-
"name": "document_content",
289+
"name": "id",
290290
"type": "Edm.String",
291+
"key": true
292+
},
293+
{
294+
"name": "document_content",
295+
"type": "Edm.String"
296+
},
291297
{
292298
"name": "sections",
293-
"type": "Edm.ComplexType",
299+
"type": "Collection(Edm.ComplexType)",
294300
"fields": [
295301
{
296302
"name": "header_level",
297-
"type": "Edm.String",
303+
"type": "Edm.String"
298304
},
299305
{
300306
"name": "header_name",
301-
"type": "Edm.String",
307+
"type": "Edm.String"
302308
},
303309
{
304310
"name": "content",
305311
"type": "Edm.String"
306312
},
307313
{
308314
"name": "ordinal_position",
309-
"type": "Edm.Int"
315+
"type": "Edm.Int32"
310316
},
311317
{
312318
"name": "sections",
313-
"type": "Edm.ComplexType",
319+
"type": "Collection(Edm.ComplexType)",
314320
"fields": [
315321
{
316322
"name": "header_level",
317-
"type": "Edm.String",
323+
"type": "Edm.String"
318324
},
319325
{
320326
"name": "header_name",
321-
"type": "Edm.String",
327+
"type": "Edm.String"
322328
},
323329
{
324330
"name": "content",
325331
"type": "Edm.String"
326332
},
327333
{
328334
"name": "ordinal_position",
329-
"type": "Edm.Int"
335+
"type": "Edm.Int32"
330336
}]
331337
}]
332-
}
338+
}]
333339
}
334340
```
335341

@@ -441,7 +447,81 @@ The resulting search document in the index would look as follows:
441447
```
442448

443449
> [!NOTE]
444-
> These examples specify how to use these parsing modes entirely with or without field mappings, but you can leverage both in one scenario if that suits your needs.
450+
> These examples specify how to use these parsing modes entirely with or without field mappings, but you can apply both in one scenario if it suits your needs.
451+
>
452+
453+
## Managing stale documents from Markdown re-indexing
454+
455+
When using one-to-many parsing mode, re-indexing a modified Markdown file can result in stale or duplicate documents if sections are removed. This behavior is specific to one-to-many mode and doesn't apply to one-to-one parsing.
456+
457+
### Behavior overview
458+
459+
#### One-to-many parsing mode
460+
In `oneToMany` mode, each Markdown section (based on headers) is indexed as a separate search document. When the file is re-indexed:
461+
462+
* **No automatic deletion**: The indexer overwrites existing documents with new ones, but it does not delete documents that no longer correspond to any content in the updated file.
463+
* **Potential for duplicates**: This issue specifically arises only when more sections are deleted than inserted between indexing runs. In such cases, leftover documents from the previous version remain in the index, leading to stale entries that no longer reflect the current state of the source file.
464+
465+
#### One-to-one parsing mode
466+
In `oneToOne` mode, the entire Markdown file is indexed as a single search document. When the file is re-indexed:
467+
* **Overwrite behavior**: The existing document is replaced entirely with the new version.
468+
* **No stale sections**: When the file is re-indexed, the existing document is replaced with the updated version and removed content is no longer included. The only exception is if the file path or blob URI changes, which could result in a new document being created alongside the old one.
469+
470+
### Workaround options
471+
472+
To ensure the index reflects the current state of your Markdown files, consider one of the following approaches:
473+
474+
#### Option 1. Soft delete with metadata
475+
This method uses a soft-delete to delete documents associated with a specific blob. For more information, see [Change and delete detection using indexers for Azure Storage in Azure AI Search](search-howto-index-changed-deleted-blobs.md#soft-delete-strategy-using-custom-metadata).
476+
477+
Steps:
478+
479+
1. Mark the blob as deleted by setting a metadata field.
480+
2. Let the indexer run. It deletes all documents in the index associated with that blob.
481+
3. Remove the soft-delete marker and re-index the file.
482+
483+
#### Option 2. Use the delete API
484+
485+
Before re-indexing a modified Markdown file, explicitly delete the existing documents associated with that file using the [delete API](/rest/api/searchservice/documents#indexactiontype). You can either:
486+
487+
* Manually indentify individual stale documents by identifying duplicates in the index to be deleted. This may be feasible for small, well-understood changes but can be time-consuming.
488+
* (**Recommended**) Remove all documents generated from the same parent file before re-indexing, ensuring inconsistencies are avoided.
489+
490+
Steps:
491+
492+
1. Identify the id of the documents associated with the file. Use a query like the following example to retrieve the document key IDs (for example, `id`, `chunk_id`, etc.) for all documents tied to a specific file. Replace `metadata_storage_path` with the appropriate field in your index that maps to the file path or blob URI. This field must be a key.
493+
```http
494+
GET https://[service name].search.windows.net/indexes/[index name]/docs?api-version=2025-05-01-preview
495+
Content-Type: application/json
496+
api-key: [admin key]
497+
498+
499+
{
500+
"filter": "metadata_storage_path eq 'https://<storage-account>.blob.core.windows.net/<container-name>/<file-name>.md'",
501+
"select": "id"
502+
}
503+
```
504+
505+
2. Issue a delete request for the documents with the identified keys.
506+
```http
507+
POST https://[service name].search.windows.net/indexes/[index name]/docs/index?api-version=2025-05-01-preview
508+
Content-Type: application/json
509+
api-key: [admin key]
510+
511+
{
512+
"value": [
513+
{
514+
"@search.action": "delete",
515+
"id": "aHR0c...jI1"
516+
},
517+
{
518+
"@search.action": "delete",
519+
"id": "aHR0...MQ2"
520+
}
521+
]
522+
}
523+
```
524+
3. Re-index the updated file.
445525
446526
## Next steps
447527

0 commit comments

Comments
 (0)