You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In Azure AI Search, indexers for Azure Blob Storage and Azure Files support a `markdown` parsing mode for Markdown files. Markdown files can be indexed in two ways:
22
+
In Azure AI Search, indexers for Azure Blob Storage, Azure Files, and OneLake support a `markdown` parsing mode for Markdown files. Markdown files can be indexed in two ways:
+ One-to-one parsing mode, creating one search document per Markdown file
26
+
27
+
## Prerequisites
28
+
29
+
+ A supported data source. For OneLake, make sure you meet all of the requirements of the [OneLake indexer](search-how-to-index-onelake-files#prerequisites). Azure Storage is a standard performance (general-purpose v2) instance that supports hot, cool, and cold access tiers.
30
+
31
+
## Markdown parsing mode parameters
32
+
33
+
Parsing mode parameters are specified in an indexer definition when you create or update an indexer.
34
+
35
+
```http
36
+
POST https://[service name].search.windows.net/indexers?api-version=2024-11-01-preview
37
+
Content-Type: application/json
38
+
api-key: [admin key]
39
+
40
+
{
41
+
"name": "my-markdown-indexer",
42
+
"dataSourceName": "my-blob-datasource",
43
+
"targetIndexName": "my-target-index",
44
+
"parameters": {
45
+
"configuration": {
46
+
"parsingMode": "markdown",
47
+
"markdownParsingSubmode": "oneToMany",
48
+
"markdownHeaderDepth": "h6"
49
+
}
50
+
},
51
+
}
52
+
```
26
53
27
54
The blob indexer provides a `submode` parameter to determine the output of structure of the search documents. Markdown parsing mode provides the following submode options:
|**`markdown`**|**`oneToMany`**| Multiple per blob | (default) Breaks the Markdown into multiple search documents, each representing a content (nonheader) section of the Markdown file. |
58
+
|**`markdown`**|**`oneToMany`**| Multiple per blob | (default) Breaks the Markdown into multiple search documents, each representing a content (nonheader) section of the Markdown file. You can omit submode unless you want one-to-one parsing.|
32
59
|**`markdown`**|**`oneToOne`**| One per blob | Parses the Markdown into one search document, with sections mapped to specific headers in the Markdown file.|
33
60
34
61
For **`oneToMany`** submode, you should review [Indexing one blob to produce many search documents](search-howto-index-one-to-many-blobs.md) to understand how the blob indexer handles disambiguation of the document key for multiple search documents produced from the same blob.
35
62
36
63
Later sections describe each submode in more detail. If you're unfamiliar with indexer clients and concepts, see [Create a search indexer](search-howto-create-indexers.md). You should also be familiar with the details of [basic blob indexer configuration](search-howto-indexing-azure-blob-storage.md), which isn't repeated here.
37
64
38
-
##Additional Markdown parsing parameters
65
+
### Optional Markdown parsing parameters
39
66
40
67
Parameters are case-sensitive.
41
68
@@ -46,13 +73,14 @@ Parameters are case-sensitive.
46
73
This setting can be changed after initial creation of the indexer, however the structure of the resulting search documents might change depending on the Markdown content.
47
74
48
75
## Supported Markdown elements
49
-
Markdown parsing will only split content based on headers. All other elements such as lists, code blocks, tables, etc., are treated as plaintext.
76
+
77
+
Markdown parsing will only split content based on headers. All other elements such as lists, code blocks, tables, and so forth, are treated as plain text and passed into a content field.
50
78
51
79
<aname="parsing-markdown-one-to-many"></a>
52
80
53
81
## Sample Markdown content
54
82
55
-
The following Markdown content will be used for the examples on this page:
83
+
The following Markdown content is used for the examples on this page:
56
84
57
85
```md
58
86
# Section 1
@@ -65,9 +93,9 @@ Content for subsection 1.1.
65
93
Content for section 2.
66
94
```
67
95
68
-
## Markdown one-to-many parsing mode (Markdown to Multiple Documents)
96
+
## Use one-to-many parsing mode
69
97
70
-
The **Markdown one-to-many parsing mode** parses Markdown files into multiple search documents, where each document corresponds to a specific content section of the Markdown file based on the header metadata at that point in the document. The Markdown is parsed based on headers into documents which contain the following content:
98
+
The one-to-many parsing mode parses Markdown files into multiple search documents, where each document corresponds to a specific content section of the Markdown file based on the header metadata at that point in the document. The Markdown is parsed based on headers into search documents which contain the following content:
71
99
72
100
-`content`: A string that contains the raw Markdown found in a specific location, based on the header metadata at that point in the document.
73
101
@@ -81,7 +109,6 @@ The **Markdown one-to-many parsing mode** parses Markdown files into multiple se
81
109
82
110
-`ordinal_position`: An integer value indicating the position of the section within the document hierarchy. This field is used for ordering the sections in their original sequence as they appear in the document, beginning with an ordinal position of 1 and incrementing sequentially for each header.
83
111
84
-
85
112
### Index schema for one-to-many parsed Markdown files
86
113
87
114
An example index configuration might look something like this:
@@ -118,7 +145,8 @@ An example index configuration might look something like this:
118
145
}
119
146
```
120
147
121
-
The blob indexer can infer the mapping without a field mapping present in the request, so an indexer configuration corresponding to the provided index configuration might look like this:
148
+
If field names and data types align, the blob indexer can infer the mapping without an explicit field mapping present in the request, so an indexer configuration corresponding to the provided index configuration might look like this:
149
+
122
150
```http
123
151
POST https://[service name].search.windows.net/indexers?api-version=2024-11-01-preview
124
152
Content-Type: application/json
@@ -168,7 +196,7 @@ api-key: [admin key]
168
196
}
169
197
```
170
198
171
-
## Map Markdown one-to-many fields to search fields
199
+
## Map one-to-many fields to search fields
172
200
173
201
Field mappings associate a source field with a destination field in situations where the field names and types aren't identical. But field mappings can also be used to match parts of a Markdown document and "lift" them into top-level fields of the search document.
174
202
@@ -207,13 +235,13 @@ The resulting search document in the index would look as follows:
207
235
208
236
<aname="parsing-markdown-one-to-one"></a>
209
237
210
-
## Markdown one-to-one parsing mode (Markdown to a single document)
238
+
## Use one-to-one parsing mode
211
239
212
-
In **Markdown one-to-one parsing mode**, the entire Markdown document is indexed as a single search document, preserving the hierarchy and structure of the original content. This mode is most useful when the files to be indexed share a common structure, so that you can leverage this common structure in the index to make the relevant fields searchable.
240
+
In the one-to-one parsing mode, the entire Markdown document is indexed as a single search document, preserving the hierarchy and structure of the original content. This mode is most useful when the files to be indexed share a common structure, so that you can use this common structure in the index to make the relevant fields searchable.
213
241
214
-
Within the indexer definition, set the `parsingMode` to "Markdown" and use the optional `markdownHeaderDepth` parameter to define the maximum heading depth for chunking. If not specified, it defaults to `h6`, capturing all possible header depths.
242
+
Within the indexer definition, set the `parsingMode` to `"markdown"` and use the optional `markdownHeaderDepth` parameter to define the maximum heading depth for chunking. If not specified, it defaults to `h6`, capturing all possible header depths.
215
243
216
-
The Markdown is parsed based on headers into documents which contain the following content:
244
+
The Markdown is parsed based on headers into search documents which contain the following content:
217
245
218
246
-`document_content`: Contains the full Markdown text as a single string. This field serves as a raw representation of the input document.
219
247
@@ -223,13 +251,13 @@ The Markdown is parsed based on headers into documents which contain the followi
223
251
224
252
-`header_name`: A string containing the text of the header as it appears in the Markdown document. This field provides a label or title for the section.
225
253
226
-
-`content`: A string containing text content that immediately follows the header, up to the next header. This field captures the detailed information or description associated with the header. If there is no content directly under a header, this is an empty string.
254
+
-`content`: A string containing text content that immediately follows the header, up to the next header. This field captures the detailed information or description associated with the header. If there's no content directly under a header, this is an empty string.
227
255
228
256
-`ordinal_position`: An integer value indicating the position of the section within the document hierarchy. This field is used for ordering the sections in their original sequence as they appear in the document, beginning with an ordinal position of 1 and incrementing sequentially for each content block.
229
257
230
258
-`sections`: An array that contains objects representing subsections nested under the current section. This array follows the same structure as the top-level `sections` array, allowing for the representation of multiple levels of nested content. Each subsection object also includes `header_level`, `header_name`, `content`, and `ordinal_position` properties, enabling a recursive structure that represents and hierarchy of the Markdown content.
231
259
232
-
Consider the following Markdown content. We use this content to explain an index schema that's designed around it, and what the search documents might look like for each parsing mode.
260
+
Here's the sample Markdown that we're using to explain an index schema that's designed around each parsing mode.
233
261
234
262
```md
235
263
# Section 1
@@ -242,8 +270,9 @@ Content for subsection 1.1.
242
270
Content for section 2.
243
271
```
244
272
245
-
### Index schema for one-to-one parsed Markdown files
246
-
If you are not utilizing field mappings, the shape of the index should reflect the shape of the Markdown content. Based on the previous Markdown, the index should look similar to the following example:
273
+
### Index schema for one-to-one parsed Markdown files
274
+
275
+
If you aren't utilizing field mappings, the shape of the index should reflect the shape of the Markdown content. Given the structure of sample Markdown with its two sections and single subsection, the index should look similar to the following example:
247
276
```http
248
277
{
249
278
"name": "my-markdown-index",
@@ -329,7 +358,7 @@ As you can see, the ordinal position increments based on the location of the con
329
358
It should also be noted that if header levels are skipped in the content, then structure of the resulting document reflects the headers that are present in the Markdown content, not necessarily containing nested sections for `h1` through `h6` consecutively. For example, when the document begins at `h2`, then the first element in the top-level sections array is `h2`.
330
359
331
360
```http
332
-
POST https://[service name].search.windows.net/indexers?api-version=2024-07-01
361
+
POST https://[service name].search.windows.net/indexers?api-version=2024-11-01-preview
333
362
Content-Type: application/json
334
363
api-key: [admin key]
335
364
@@ -345,7 +374,9 @@ api-key: [admin key]
345
374
}
346
375
}
347
376
```
348
-
## Map Markdown one-to-one fields to search fields
377
+
378
+
## Map one-to-one fields to search fields
379
+
349
380
If you would like to extract fields with custom names from the document, you can use field mappings to do so. Using the same Markdown sample as before, consider the following index configuration:
350
381
351
382
```http
@@ -372,7 +403,7 @@ If you would like to extract fields with custom names from the document, you can
372
403
}
373
404
```
374
405
375
-
Extracting specific fields from the parsed Markdown is handled similar to how the document paths are in (outputFieldMappings)[https://learn.microsoft.com/en-us/azure/search/cognitive-search-output-field-mapping?tabs=rest], except the path begins with `/sections` instead of `/document`. So, for example, `/sections/0/content` would map to the content under the item at position 0 in the sections array.
406
+
Extracting specific fields from the parsed Markdown is handled similar to how the document paths are in [outputFieldMappings](cognitive-search-output-field-mapping.md), except the path begins with `/sections` instead of `/document`. So, for example, `/sections/0/content` would map to the content under the item at position 0 in the sections array.
376
407
377
408
An example of a strong use case might look something like this: all Markdown files have a document title in the first `h1`, a subsection title in the first `h2`, and a summary in the content of the final paragraph underneath the final `h1`. You could use the following field mappings to index only that content:
0 commit comments