You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In Azure AI Search, indexers for Azure Blob Storage, Azure Files, and OneLake support a `markdown` parsing mode for Markdown files. Markdown files can be indexed in two ways:
+ One-to-one parsing mode, creating one search document per Markdown file
26
24
27
25
## Prerequisites
28
26
29
-
+ A supported data source. For OneLake, make sure you meet all of the requirements of the [OneLake indexer](search-how-to-index-onelake-files#prerequisites). Azure Storage is a standard performance (general-purpose v2) instance that supports hot, cool, and cold access tiers.
27
+
+ A supported data source: Azure Blob storage, Azure File storage, OneLake in Microsoft Fabric.
28
+
29
+
For OneLake, make sure you meet all of the requirements of the [OneLake indexer](search-how-to-index-onelake-files.md#prerequisites).
30
+
31
+
Azure Storage for [blob indexers](search-howto-indexing-azure-blob-storage.md#prerequisites) and [file indexers](search-file-storage-integration.md#prerequisites) is a standard performance (general-purpose v2) instance that supports hot, cool, and cold access tiers.
30
32
31
33
## Markdown parsing mode parameters
32
34
@@ -109,7 +111,7 @@ The one-to-many parsing mode parses Markdown files into multiple search document
109
111
110
112
-`ordinal_position`: An integer value indicating the position of the section within the document hierarchy. This field is used for ordering the sections in their original sequence as they appear in the document, beginning with an ordinal position of 1 and incrementing sequentially for each header.
111
113
112
-
### Index schema for one-to-many parsed Markdown files
114
+
### Index schema for one-to-many parsing
113
115
114
116
An example index configuration might look something like this:
115
117
```http
@@ -145,6 +147,8 @@ An example index configuration might look something like this:
145
147
}
146
148
```
147
149
150
+
### Indexer definition for one-to-many parsing
151
+
148
152
If field names and data types align, the blob indexer can infer the mapping without an explicit field mapping present in the request, so an indexer configuration corresponding to the provided index configuration might look like this:
149
153
150
154
```http
@@ -165,7 +169,9 @@ api-key: [admin key]
165
169
> [!NOTE]
166
170
> The `submode` does not need to be set explicitly here because `oneToMany` is the default.
167
171
168
-
This Markdown file would result in three search documents after indexing, due to the three content sections. The search document resulting from the first content section of the provided Markdown document would contain the following values for `content`, `sections`, `h1`, and `h2`:
172
+
### Indexer output for one-to-many parsing
173
+
174
+
This Markdown file would result in three search documents after indexing, due to the three content sections. The search document resulting from the first content section of the provided Markdown document would contain the following values for `content`, `sections`, `h1`, and `h2`:
169
175
170
176
```http
171
177
{
@@ -270,7 +276,7 @@ Content for subsection 1.1.
270
276
Content for section 2.
271
277
```
272
278
273
-
### Index schema for one-to-one parsed Markdown files
279
+
### Index schema for one-to-one parsing
274
280
275
281
If you aren't utilizing field mappings, the shape of the index should reflect the shape of the Markdown content. Given the structure of sample Markdown with its two sections and single subsection, the index should look similar to the following example:
276
282
```http
@@ -325,6 +331,28 @@ If you aren't utilizing field mappings, the shape of the index should reflect th
325
331
}
326
332
```
327
333
334
+
### Indexer definition for one-to-one parsing
335
+
336
+
```http
337
+
POST https://[service name].search.windows.net/indexers?api-version=2024-11-01-preview
338
+
Content-Type: application/json
339
+
api-key: [admin key]
340
+
341
+
{
342
+
"name": "my-markdown-indexer",
343
+
"dataSourceName": "my-blob-datasource",
344
+
"targetIndexName": "my-target-index",
345
+
"parameters": {
346
+
"configuration": {
347
+
"parsingMode": "markdown",
348
+
"markdownParsingSubmode": "oneToMany",
349
+
}
350
+
}
351
+
}
352
+
```
353
+
354
+
### Indexer output for one-to-one parsing
355
+
328
356
Because the Markdown we want to index only goes to a depth of `h2` ("##"), we need `sections` fields nested to a depth of 2 to match that. This configuration would result in the following data in the index:
329
357
330
358
```http
@@ -357,24 +385,6 @@ As you can see, the ordinal position increments based on the location of the con
357
385
358
386
It should also be noted that if header levels are skipped in the content, then structure of the resulting document reflects the headers that are present in the Markdown content, not necessarily containing nested sections for `h1` through `h6` consecutively. For example, when the document begins at `h2`, then the first element in the top-level sections array is `h2`.
359
387
360
-
```http
361
-
POST https://[service name].search.windows.net/indexers?api-version=2024-11-01-preview
362
-
Content-Type: application/json
363
-
api-key: [admin key]
364
-
365
-
{
366
-
"name": "my-markdown-indexer",
367
-
"dataSourceName": "my-blob-datasource",
368
-
"targetIndexName": "my-target-index",
369
-
"parameters": {
370
-
"configuration": {
371
-
"parsingMode": "markdown",
372
-
"markdownParsingSubmode": "oneToMany",
373
-
}
374
-
}
375
-
}
376
-
```
377
-
378
388
## Map one-to-one fields to search fields
379
389
380
390
If you would like to extract fields with custom names from the document, you can use field mappings to do so. Using the same Markdown sample as before, consider the following index configuration:
0 commit comments