You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/search/search-how-to-index-markdown-blobs.md
+97-17Lines changed: 97 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -78,7 +78,7 @@ This setting can be changed after initial creation of the indexer, however the s
78
78
79
79
## Supported Markdown elements
80
80
81
-
Markdown parsing will only split content based on headers. All other elements such as lists, code blocks, tables, and so forth, are treated as plain text and passed into a content field.
81
+
Markdown parsing only splits content based on headers. All other elements such as lists, code blocks, tables, and so forth, are treated as plain text and passed into a content field.
82
82
83
83
<aname="parsing-markdown-one-to-many"></a>
84
84
@@ -99,7 +99,7 @@ Content for section 2.
99
99
100
100
## Use one-to-many parsing mode
101
101
102
-
The one-to-many parsing mode parses Markdown files into multiple search documents, where each document corresponds to a specific content section of the Markdown file based on the header metadata at that point in the document. The Markdown is parsed based on headers into search documents which contain the following content:
102
+
The one-to-many parsing mode parses Markdown files into multiple search documents, where each document corresponds to a specific content section of the Markdown file based on the header metadata at that point in the document. The Markdown is parsed based on headers into search documents, which contain the following content:
103
103
104
104
-`content`: A string that contains the raw Markdown found in a specific location, based on the header metadata at that point in the document.
105
105
@@ -169,7 +169,7 @@ api-key: [admin key]
169
169
```
170
170
171
171
> [!NOTE]
172
-
> The `submode`does not need to be set explicitly here because `oneToMany` is the default.
172
+
> The `submode`doesn't need to be set explicitly here because `oneToMany` is the default.
173
173
174
174
### Indexer output for one-to-many parsing
175
175
@@ -249,7 +249,7 @@ In the one-to-one parsing mode, the entire Markdown document is indexed as a sin
249
249
250
250
Within the indexer definition, set the `parsingMode` to `"markdown"` and use the optional `markdownHeaderDepth` parameter to define the maximum heading depth for chunking. If not specified, it defaults to `h6`, capturing all possible header depths.
251
251
252
-
The Markdown is parsed based on headers into search documents which contain the following content:
252
+
The Markdown is parsed based on headers into search documents, which contain the following content:
253
253
254
254
-`document_content`: Contains the full Markdown text as a single string. This field serves as a raw representation of the input document.
255
255
@@ -259,13 +259,13 @@ The Markdown is parsed based on headers into search documents which contain the
259
259
260
260
-`header_name`: A string containing the text of the header as it appears in the Markdown document. This field provides a label or title for the section.
261
261
262
-
-`content`: A string containing text content that immediately follows the header, up to the next header. This field captures the detailed information or description associated with the header. If there's no content directly under a header, this is an empty string.
262
+
-`content`: A string containing text content that immediately follows the header, up to the next header. This field captures the detailed information or description associated with the header. If there's no content directly under a header, the value is an empty string.
263
263
264
264
-`ordinal_position`: An integer value indicating the position of the section within the document hierarchy. This field is used for ordering the sections in their original sequence as they appear in the document, beginning with an ordinal position of 1 and incrementing sequentially for each content block.
265
265
266
266
-`sections`: An array that contains objects representing subsections nested under the current section. This array follows the same structure as the top-level `sections` array, allowing for the representation of multiple levels of nested content. Each subsection object also includes `header_level`, `header_name`, `content`, and `ordinal_position` properties, enabling a recursive structure that represents and hierarchy of the Markdown content.
267
267
268
-
Here's the sample Markdown that we're using to explain an index schema that's designed around each parsing mode.
268
+
Here's the sample Markdown that we're using to explain the index schemas designed around each parsing mode.
269
269
270
270
```md
271
271
# Section 1
@@ -286,50 +286,56 @@ If you aren't utilizing field mappings, the shape of the index should reflect th
286
286
"name": "my-markdown-index",
287
287
"fields": [
288
288
{
289
-
"name": "document_content",
289
+
"name": "id",
290
290
"type": "Edm.String",
291
+
"key": true
292
+
},
293
+
{
294
+
"name": "document_content",
295
+
"type": "Edm.String"
296
+
},
291
297
{
292
298
"name": "sections",
293
-
"type": "Edm.ComplexType",
299
+
"type": "Collection(Edm.ComplexType)",
294
300
"fields": [
295
301
{
296
302
"name": "header_level",
297
-
"type": "Edm.String",
303
+
"type": "Edm.String"
298
304
},
299
305
{
300
306
"name": "header_name",
301
-
"type": "Edm.String",
307
+
"type": "Edm.String"
302
308
},
303
309
{
304
310
"name": "content",
305
311
"type": "Edm.String"
306
312
},
307
313
{
308
314
"name": "ordinal_position",
309
-
"type": "Edm.Int"
315
+
"type": "Edm.Int32"
310
316
},
311
317
{
312
318
"name": "sections",
313
-
"type": "Edm.ComplexType",
319
+
"type": "Collection(Edm.ComplexType)",
314
320
"fields": [
315
321
{
316
322
"name": "header_level",
317
-
"type": "Edm.String",
323
+
"type": "Edm.String"
318
324
},
319
325
{
320
326
"name": "header_name",
321
-
"type": "Edm.String",
327
+
"type": "Edm.String"
322
328
},
323
329
{
324
330
"name": "content",
325
331
"type": "Edm.String"
326
332
},
327
333
{
328
334
"name": "ordinal_position",
329
-
"type": "Edm.Int"
335
+
"type": "Edm.Int32"
330
336
}]
331
337
}]
332
-
}
338
+
}]
333
339
}
334
340
```
335
341
@@ -441,7 +447,81 @@ The resulting search document in the index would look as follows:
441
447
```
442
448
443
449
> [!NOTE]
444
-
> These examples specify how to use these parsing modes entirely with or without field mappings, but you can leverage both in one scenario if that suits your needs.
450
+
> These examples specify how to use these parsing modes entirely with or without field mappings, but you can apply both in one scenario if it suits your needs.
451
+
>
452
+
453
+
## Managing stale documents from Markdown re-indexing
454
+
455
+
When using one-to-many parsing mode, re-indexing a modified Markdown file can result in stale or duplicate documents if sections are removed. This behavior is specific to one-to-many mode and doesn't apply to one-to-one parsing.
456
+
457
+
### Behavior overview
458
+
459
+
#### One-to-many parsing mode
460
+
In `oneToMany` mode, each Markdown section (based on headers) is indexed as a separate search document. When the file is re-indexed:
461
+
462
+
***No automatic deletion**: The indexer overwrites existing documents with new ones, but it does not delete documents that no longer correspond to any content in the updated file.
463
+
***Potential for duplicates**: This issue specifically arises only when more sections are deleted than inserted between indexing runs. In such cases, leftover documents from the previous version remain in the index, leading to stale entries that no longer reflect the current state of the source file.
464
+
465
+
#### One-to-one parsing mode
466
+
In `oneToOne` mode, the entire Markdown file is indexed as a single search document. When the file is re-indexed:
467
+
***Overwrite behavior**: The existing document is replaced entirely with the new version.
468
+
***No stale sections**: When the file is re-indexed, the existing document is replaced with the updated version and removed content is no longer included. The only exception is if the file path or blob URI changes, which could result in a new document being created alongside the old one.
469
+
470
+
### Workaround options
471
+
472
+
To ensure the index reflects the current state of your Markdown files, consider one of the following approaches:
473
+
474
+
#### Option 1. Soft delete with metadata
475
+
This method uses a soft-delete to delete documents associated with a specific blob. For more information, see [Change and delete detection using indexers for Azure Storage in Azure AI Search](search-howto-index-changed-deleted-blobs.md#soft-delete-strategy-using-custom-metadata).
476
+
477
+
Steps:
478
+
479
+
1. Mark the blob as deleted by setting a metadata field.
480
+
2. Let the indexer run. It deletes all documents in the index associated with that blob.
481
+
3. Remove the soft-delete marker and re-index the file.
482
+
483
+
#### Option 2. Use the delete API
484
+
485
+
Before re-indexing a modified Markdown file, explicitly delete the existing documents associated with that file using the [delete API](/rest/api/searchservice/documents#indexactiontype). You can either:
486
+
487
+
* Manually indentify individual stale documents by identifying duplicates in the index to be deleted. This may be feasible for small, well-understood changes but can be time-consuming.
488
+
* (**Recommended**) Remove all documents generated from the same parent file before re-indexing, ensuring inconsistencies are avoided.
489
+
490
+
Steps:
491
+
492
+
1. Identify the id of the documents associated with the file. Use a query like the following example to retrieve the document key IDs (for example, `id`, `chunk_id`, etc.) for all documents tied to a specific file. Replace `metadata_storage_path` with the appropriate field in your index that maps to the file path or blob URI. This field must be a key.
493
+
```http
494
+
GET https://[service name].search.windows.net/indexes/[index name]/docs?api-version=2025-05-01-preview
0 commit comments