You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In Azure AI Search, indexers for Azure Blob Storage, Azure Files, and OneLake support a `markdown` parsing mode for Markdown files. Markdown files can be indexed in two ways:
21
21
22
-
In Azure AI Search, indexers for Azure Blob Storage and Azure Files support a `markdown` parsing mode for Markdown files. Markdown files can be indexed in two ways:
+ One-to-one parsing mode, creating one search document per Markdown file
23
24
24
-
+ One-to-many parsing mode
25
-
+ One-to-one parsing mode
25
+
## Prerequisites
26
+
27
+
+ A supported data source: Azure Blob storage, Azure File storage, OneLake in Microsoft Fabric.
28
+
29
+
For OneLake, make sure you meet all of the requirements of the [OneLake indexer](search-how-to-index-onelake-files.md#prerequisites).
30
+
31
+
Azure Storage for [blob indexers](search-howto-indexing-azure-blob-storage.md#prerequisites) and [file indexers](search-file-storage-integration.md#prerequisites) is a standard performance (general-purpose v2) instance that supports hot, cool, and cold access tiers.
32
+
33
+
## Markdown parsing mode parameters
34
+
35
+
Parsing mode parameters are specified in an indexer definition when you create or update an indexer.
36
+
37
+
```http
38
+
POST https://[service name].search.windows.net/indexers?api-version=2024-11-01-preview
39
+
Content-Type: application/json
40
+
api-key: [admin key]
41
+
42
+
{
43
+
"name": "my-markdown-indexer",
44
+
"dataSourceName": "my-blob-datasource",
45
+
"targetIndexName": "my-target-index",
46
+
"parameters": {
47
+
"configuration": {
48
+
"parsingMode": "markdown",
49
+
"markdownParsingSubmode": "oneToMany",
50
+
"markdownHeaderDepth": "h6"
51
+
}
52
+
},
53
+
}
54
+
```
26
55
27
56
The blob indexer provides a `submode` parameter to determine the output of structure of the search documents. Markdown parsing mode provides the following submode options:
|**`markdown`**|**`oneToMany`**| Multiple per blob | (default) Breaks the Markdown into multiple search documents, each representing a content (nonheader) section of the Markdown file. |
60
+
|**`markdown`**|**`oneToMany`**| Multiple per blob | (default) Breaks the Markdown into multiple search documents, each representing a content (nonheader) section of the Markdown file. You can omit submode unless you want one-to-one parsing.|
32
61
|**`markdown`**|**`oneToOne`**| One per blob | Parses the Markdown into one search document, with sections mapped to specific headers in the Markdown file.|
33
62
34
63
For **`oneToMany`** submode, you should review [Indexing one blob to produce many search documents](search-howto-index-one-to-many-blobs.md) to understand how the blob indexer handles disambiguation of the document key for multiple search documents produced from the same blob.
35
64
36
65
Later sections describe each submode in more detail. If you're unfamiliar with indexer clients and concepts, see [Create a search indexer](search-howto-create-indexers.md). You should also be familiar with the details of [basic blob indexer configuration](search-howto-indexing-azure-blob-storage.md), which isn't repeated here.
37
66
38
-
##Additional Markdown parsing parameters
67
+
### Optional Markdown parsing parameters
39
68
40
69
Parameters are case-sensitive.
41
70
@@ -46,13 +75,14 @@ Parameters are case-sensitive.
46
75
This setting can be changed after initial creation of the indexer, however the structure of the resulting search documents might change depending on the Markdown content.
47
76
48
77
## Supported Markdown elements
49
-
Markdown parsing will only split content based on headers. All other elements such as lists, code blocks, tables, etc., are treated as plaintext.
78
+
79
+
Markdown parsing will only split content based on headers. All other elements such as lists, code blocks, tables, and so forth, are treated as plain text and passed into a content field.
50
80
51
81
<aname="parsing-markdown-one-to-many"></a>
52
82
53
83
## Sample Markdown content
54
84
55
-
The following Markdown content will be used for the examples on this page:
85
+
The following Markdown content is used for the examples on this page:
56
86
57
87
```md
58
88
# Section 1
@@ -65,9 +95,9 @@ Content for subsection 1.1.
65
95
Content for section 2.
66
96
```
67
97
68
-
## Markdown one-to-many parsing mode (Markdown to Multiple Documents)
98
+
## Use one-to-many parsing mode
69
99
70
-
The **Markdown one-to-many parsing mode** parses Markdown files into multiple search documents, where each document corresponds to a specific content section of the Markdown file based on the header metadata at that point in the document. The Markdown is parsed based on headers into documents which contain the following content:
100
+
The one-to-many parsing mode parses Markdown files into multiple search documents, where each document corresponds to a specific content section of the Markdown file based on the header metadata at that point in the document. The Markdown is parsed based on headers into search documents which contain the following content:
71
101
72
102
-`content`: A string that contains the raw Markdown found in a specific location, based on the header metadata at that point in the document.
73
103
@@ -81,8 +111,7 @@ The **Markdown one-to-many parsing mode** parses Markdown files into multiple se
81
111
82
112
-`ordinal_position`: An integer value indicating the position of the section within the document hierarchy. This field is used for ordering the sections in their original sequence as they appear in the document, beginning with an ordinal position of 1 and incrementing sequentially for each header.
83
113
84
-
85
-
### Index schema for one-to-many parsed Markdown files
114
+
### Index schema for one-to-many parsing
86
115
87
116
An example index configuration might look something like this:
88
117
```http
@@ -118,7 +147,10 @@ An example index configuration might look something like this:
118
147
}
119
148
```
120
149
121
-
The blob indexer can infer the mapping without a field mapping present in the request, so an indexer configuration corresponding to the provided index configuration might look like this:
150
+
### Indexer definition for one-to-many parsing
151
+
152
+
If field names and data types align, the blob indexer can infer the mapping without an explicit field mapping present in the request, so an indexer configuration corresponding to the provided index configuration might look like this:
153
+
122
154
```http
123
155
POST https://[service name].search.windows.net/indexers?api-version=2024-11-01-preview
124
156
Content-Type: application/json
@@ -137,7 +169,9 @@ api-key: [admin key]
137
169
> [!NOTE]
138
170
> The `submode` does not need to be set explicitly here because `oneToMany` is the default.
139
171
140
-
This Markdown file would result in three search documents after indexing, due to the three content sections. The search document resulting from the first content section of the provided Markdown document would contain the following values for `content`, `sections`, `h1`, and `h2`:
172
+
### Indexer output for one-to-many parsing
173
+
174
+
This Markdown file would result in three search documents after indexing, due to the three content sections. The search document resulting from the first content section of the provided Markdown document would contain the following values for `content`, `sections`, `h1`, and `h2`:
141
175
142
176
```http
143
177
{
@@ -168,7 +202,7 @@ api-key: [admin key]
168
202
}
169
203
```
170
204
171
-
## Map Markdown one-to-many fields to search fields
205
+
###Map one-to-many fields in a search index
172
206
173
207
Field mappings associate a source field with a destination field in situations where the field names and types aren't identical. But field mappings can also be used to match parts of a Markdown document and "lift" them into top-level fields of the search document.
174
208
@@ -207,13 +241,13 @@ The resulting search document in the index would look as follows:
207
241
208
242
<aname="parsing-markdown-one-to-one"></a>
209
243
210
-
## Markdown one-to-one parsing mode (Markdown to a single document)
244
+
## Use one-to-one parsing mode
211
245
212
-
In **Markdown one-to-one parsing mode**, the entire Markdown document is indexed as a single search document, preserving the hierarchy and structure of the original content. This mode is most useful when the files to be indexed share a common structure, so that you can leverage this common structure in the index to make the relevant fields searchable.
246
+
In the one-to-one parsing mode, the entire Markdown document is indexed as a single search document, preserving the hierarchy and structure of the original content. This mode is most useful when the files to be indexed share a common structure, so that you can use this common structure in the index to make the relevant fields searchable.
213
247
214
-
Within the indexer definition, set the `parsingMode` to "Markdown" and use the optional `markdownHeaderDepth` parameter to define the maximum heading depth for chunking. If not specified, it defaults to `h6`, capturing all possible header depths.
248
+
Within the indexer definition, set the `parsingMode` to `"markdown"` and use the optional `markdownHeaderDepth` parameter to define the maximum heading depth for chunking. If not specified, it defaults to `h6`, capturing all possible header depths.
215
249
216
-
The Markdown is parsed based on headers into documents which contain the following content:
250
+
The Markdown is parsed based on headers into search documents which contain the following content:
217
251
218
252
-`document_content`: Contains the full Markdown text as a single string. This field serves as a raw representation of the input document.
219
253
@@ -223,13 +257,13 @@ The Markdown is parsed based on headers into documents which contain the followi
223
257
224
258
-`header_name`: A string containing the text of the header as it appears in the Markdown document. This field provides a label or title for the section.
225
259
226
-
-`content`: A string containing text content that immediately follows the header, up to the next header. This field captures the detailed information or description associated with the header. If there is no content directly under a header, this is an empty string.
260
+
-`content`: A string containing text content that immediately follows the header, up to the next header. This field captures the detailed information or description associated with the header. If there's no content directly under a header, this is an empty string.
227
261
228
262
-`ordinal_position`: An integer value indicating the position of the section within the document hierarchy. This field is used for ordering the sections in their original sequence as they appear in the document, beginning with an ordinal position of 1 and incrementing sequentially for each content block.
229
263
230
264
-`sections`: An array that contains objects representing subsections nested under the current section. This array follows the same structure as the top-level `sections` array, allowing for the representation of multiple levels of nested content. Each subsection object also includes `header_level`, `header_name`, `content`, and `ordinal_position` properties, enabling a recursive structure that represents and hierarchy of the Markdown content.
231
265
232
-
Consider the following Markdown content. We use this content to explain an index schema that's designed around it, and what the search documents might look like for each parsing mode.
266
+
Here's the sample Markdown that we're using to explain an index schema that's designed around each parsing mode.
233
267
234
268
```md
235
269
# Section 1
@@ -242,8 +276,9 @@ Content for subsection 1.1.
242
276
Content for section 2.
243
277
```
244
278
245
-
### Index schema for one-to-one parsed Markdown files
246
-
If you are not utilizing field mappings, the shape of the index should reflect the shape of the Markdown content. Based on the previous Markdown, the index should look similar to the following example:
279
+
### Index schema for one-to-one parsing
280
+
281
+
If you aren't utilizing field mappings, the shape of the index should reflect the shape of the Markdown content. Given the structure of sample Markdown with its two sections and single subsection, the index should look similar to the following example:
247
282
```http
248
283
{
249
284
"name": "my-markdown-index",
@@ -296,6 +331,28 @@ If you are not utilizing field mappings, the shape of the index should reflect t
296
331
}
297
332
```
298
333
334
+
### Indexer definition for one-to-one parsing
335
+
336
+
```http
337
+
POST https://[service name].search.windows.net/indexers?api-version=2024-11-01-preview
338
+
Content-Type: application/json
339
+
api-key: [admin key]
340
+
341
+
{
342
+
"name": "my-markdown-indexer",
343
+
"dataSourceName": "my-blob-datasource",
344
+
"targetIndexName": "my-target-index",
345
+
"parameters": {
346
+
"configuration": {
347
+
"parsingMode": "markdown",
348
+
"markdownParsingSubmode": "oneToOne",
349
+
}
350
+
}
351
+
}
352
+
```
353
+
354
+
### Indexer output for one-to-one parsing
355
+
299
356
Because the Markdown we want to index only goes to a depth of `h2` ("##"), we need `sections` fields nested to a depth of 2 to match that. This configuration would result in the following data in the index:
300
357
301
358
```http
@@ -328,24 +385,8 @@ As you can see, the ordinal position increments based on the location of the con
328
385
329
386
It should also be noted that if header levels are skipped in the content, then structure of the resulting document reflects the headers that are present in the Markdown content, not necessarily containing nested sections for `h1` through `h6` consecutively. For example, when the document begins at `h2`, then the first element in the top-level sections array is `h2`.
330
387
331
-
```http
332
-
POST https://[service name].search.windows.net/indexers?api-version=2024-07-01
333
-
Content-Type: application/json
334
-
api-key: [admin key]
388
+
### Map one-to-one fields in a search index
335
389
336
-
{
337
-
"name": "my-markdown-indexer",
338
-
"dataSourceName": "my-blob-datasource",
339
-
"targetIndexName": "my-target-index",
340
-
"parameters": {
341
-
"configuration": {
342
-
"parsingMode": "markdown",
343
-
"markdownParsingSubmode": "oneToMany",
344
-
}
345
-
}
346
-
}
347
-
```
348
-
## Map Markdown one-to-one fields to search fields
349
390
If you would like to extract fields with custom names from the document, you can use field mappings to do so. Using the same Markdown sample as before, consider the following index configuration:
350
391
351
392
```http
@@ -372,7 +413,7 @@ If you would like to extract fields with custom names from the document, you can
372
413
}
373
414
```
374
415
375
-
Extracting specific fields from the parsed Markdown is handled similar to how the document paths are in (outputFieldMappings)[https://learn.microsoft.com/en-us/azure/search/cognitive-search-output-field-mapping?tabs=rest], except the path begins with `/sections` instead of `/document`. So, for example, `/sections/0/content` would map to the content under the item at position 0 in the sections array.
416
+
Extracting specific fields from the parsed Markdown is handled similar to how the document paths are in [outputFieldMappings](cognitive-search-output-field-mapping.md), except the path begins with `/sections` instead of `/document`. So, for example, `/sections/0/content` would map to the content under the item at position 0 in the sections array.
376
417
377
418
An example of a strong use case might look something like this: all Markdown files have a document title in the first `h1`, a subsection title in the first `h2`, and a summary in the content of the final paragraph underneath the final `h1`. You could use the following field mappings to index only that content:
0 commit comments