Skip to content

Commit c50d2bd

Browse files
committed
template pass over markdown parsing howto
1 parent b4e2eb2 commit c50d2bd

File tree

1 file changed

+54
-23
lines changed

1 file changed

+54
-23
lines changed

articles/search/search-how-to-index-markdown-blobs.md

Lines changed: 54 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -19,23 +19,50 @@ ms.date: 11/19/2024
1919

2020
**Applies to**: [Blob indexers](search-howto-indexing-azure-blob-storage.md), [OneLake indexers](search-how-to-index-onelake-files.md), [File indexers](search-file-storage-integration.md)
2121

22-
In Azure AI Search, indexers for Azure Blob Storage and Azure Files support a `markdown` parsing mode for Markdown files. Markdown files can be indexed in two ways:
22+
In Azure AI Search, indexers for Azure Blob Storage, Azure Files, and OneLake support a `markdown` parsing mode for Markdown files. Markdown files can be indexed in two ways:
2323

24-
+ One-to-many parsing mode
25-
+ One-to-one parsing mode
24+
+ One-to-many parsing mode, creating multiple search documents per Markdown file
25+
+ One-to-one parsing mode, creating one search document per Markdown file
26+
27+
## Prerequisites
28+
29+
+ A supported data source. For OneLake, make sure you meet all of the requirements of the [OneLake indexer](search-how-to-index-onelake-files#prerequisites). Azure Storage is a standard performance (general-purpose v2) instance that supports hot, cool, and cold access tiers.
30+
31+
## Markdown parsing mode parameters
32+
33+
Parsing mode parameters are specified in an indexer definition when you create or update an indexer.
34+
35+
```http
36+
POST https://[service name].search.windows.net/indexers?api-version=2024-11-01-preview
37+
Content-Type: application/json
38+
api-key: [admin key]
39+
40+
{
41+
"name": "my-markdown-indexer",
42+
"dataSourceName": "my-blob-datasource",
43+
"targetIndexName": "my-target-index",
44+
"parameters": {
45+
"configuration": {
46+
"parsingMode": "markdown",
47+
"markdownParsingSubmode": "oneToMany",
48+
"markdownHeaderDepth": "h6"
49+
}
50+
},
51+
}
52+
```
2653

2754
The blob indexer provides a `submode` parameter to determine the output of structure of the search documents. Markdown parsing mode provides the following submode options:
2855

2956
| parsingMode | submode | Search document | Description |
3057
|--------------|-------------|-------------|--------------|
31-
| **`markdown`** | **`oneToMany`** | Multiple per blob | (default) Breaks the Markdown into multiple search documents, each representing a content (nonheader) section of the Markdown file. |
58+
| **`markdown`** | **`oneToMany`** | Multiple per blob | (default) Breaks the Markdown into multiple search documents, each representing a content (nonheader) section of the Markdown file. You can omit submode unless you want one-to-one parsing.|
3259
| **`markdown`** | **`oneToOne`** | One per blob | Parses the Markdown into one search document, with sections mapped to specific headers in the Markdown file.|
3360

3461
For **`oneToMany`** submode, you should review [Indexing one blob to produce many search documents](search-howto-index-one-to-many-blobs.md) to understand how the blob indexer handles disambiguation of the document key for multiple search documents produced from the same blob.
3562

3663
Later sections describe each submode in more detail. If you're unfamiliar with indexer clients and concepts, see [Create a search indexer](search-howto-create-indexers.md). You should also be familiar with the details of [basic blob indexer configuration](search-howto-indexing-azure-blob-storage.md), which isn't repeated here.
3764

38-
## Additional Markdown parsing parameters
65+
### Optional Markdown parsing parameters
3966

4067
Parameters are case-sensitive.
4168

@@ -46,13 +73,14 @@ Parameters are case-sensitive.
4673
This setting can be changed after initial creation of the indexer, however the structure of the resulting search documents might change depending on the Markdown content.
4774

4875
## Supported Markdown elements
49-
Markdown parsing will only split content based on headers. All other elements such as lists, code blocks, tables, etc., are treated as plaintext.
76+
77+
Markdown parsing will only split content based on headers. All other elements such as lists, code blocks, tables, and so forth, are treated as plain text and passed into a content field.
5078

5179
<a name="parsing-markdown-one-to-many"></a>
5280

5381
## Sample Markdown content
5482

55-
The following Markdown content will be used for the examples on this page:
83+
The following Markdown content is used for the examples on this page:
5684

5785
```md
5886
# Section 1
@@ -65,9 +93,9 @@ Content for subsection 1.1.
6593
Content for section 2.
6694
```
6795

68-
## Markdown one-to-many parsing mode (Markdown to Multiple Documents)
96+
## Use one-to-many parsing mode
6997

70-
The **Markdown one-to-many parsing mode** parses Markdown files into multiple search documents, where each document corresponds to a specific content section of the Markdown file based on the header metadata at that point in the document. The Markdown is parsed based on headers into documents which contain the following content:
98+
The one-to-many parsing mode parses Markdown files into multiple search documents, where each document corresponds to a specific content section of the Markdown file based on the header metadata at that point in the document. The Markdown is parsed based on headers into search documents which contain the following content:
7199

72100
- `content`: A string that contains the raw Markdown found in a specific location, based on the header metadata at that point in the document.
73101

@@ -81,7 +109,6 @@ The **Markdown one-to-many parsing mode** parses Markdown files into multiple se
81109

82110
- `ordinal_position`: An integer value indicating the position of the section within the document hierarchy. This field is used for ordering the sections in their original sequence as they appear in the document, beginning with an ordinal position of 1 and incrementing sequentially for each header.
83111

84-
85112
### Index schema for one-to-many parsed Markdown files
86113

87114
An example index configuration might look something like this:
@@ -118,7 +145,8 @@ An example index configuration might look something like this:
118145
}
119146
```
120147

121-
The blob indexer can infer the mapping without a field mapping present in the request, so an indexer configuration corresponding to the provided index configuration might look like this:
148+
If field names and data types align, the blob indexer can infer the mapping without an explicit field mapping present in the request, so an indexer configuration corresponding to the provided index configuration might look like this:
149+
122150
```http
123151
POST https://[service name].search.windows.net/indexers?api-version=2024-11-01-preview
124152
Content-Type: application/json
@@ -168,7 +196,7 @@ api-key: [admin key]
168196
}
169197
```
170198

171-
## Map Markdown one-to-many fields to search fields
199+
## Map one-to-many fields to search fields
172200

173201
Field mappings associate a source field with a destination field in situations where the field names and types aren't identical. But field mappings can also be used to match parts of a Markdown document and "lift" them into top-level fields of the search document.
174202

@@ -207,13 +235,13 @@ The resulting search document in the index would look as follows:
207235

208236
<a name="parsing-markdown-one-to-one"></a>
209237

210-
## Markdown one-to-one parsing mode (Markdown to a single document)
238+
## Use one-to-one parsing mode
211239

212-
In **Markdown one-to-one parsing mode**, the entire Markdown document is indexed as a single search document, preserving the hierarchy and structure of the original content. This mode is most useful when the files to be indexed share a common structure, so that you can leverage this common structure in the index to make the relevant fields searchable.
240+
In the one-to-one parsing mode, the entire Markdown document is indexed as a single search document, preserving the hierarchy and structure of the original content. This mode is most useful when the files to be indexed share a common structure, so that you can use this common structure in the index to make the relevant fields searchable.
213241

214-
Within the indexer definition, set the `parsingMode` to "Markdown" and use the optional `markdownHeaderDepth` parameter to define the maximum heading depth for chunking. If not specified, it defaults to `h6`, capturing all possible header depths.
242+
Within the indexer definition, set the `parsingMode` to `"markdown"` and use the optional `markdownHeaderDepth` parameter to define the maximum heading depth for chunking. If not specified, it defaults to `h6`, capturing all possible header depths.
215243

216-
The Markdown is parsed based on headers into documents which contain the following content:
244+
The Markdown is parsed based on headers into search documents which contain the following content:
217245

218246
- `document_content`: Contains the full Markdown text as a single string. This field serves as a raw representation of the input document.
219247

@@ -223,13 +251,13 @@ The Markdown is parsed based on headers into documents which contain the followi
223251

224252
- `header_name`: A string containing the text of the header as it appears in the Markdown document. This field provides a label or title for the section.
225253

226-
- `content`: A string containing text content that immediately follows the header, up to the next header. This field captures the detailed information or description associated with the header. If there is no content directly under a header, this is an empty string.
254+
- `content`: A string containing text content that immediately follows the header, up to the next header. This field captures the detailed information or description associated with the header. If there's no content directly under a header, this is an empty string.
227255

228256
- `ordinal_position`: An integer value indicating the position of the section within the document hierarchy. This field is used for ordering the sections in their original sequence as they appear in the document, beginning with an ordinal position of 1 and incrementing sequentially for each content block.
229257

230258
- `sections`: An array that contains objects representing subsections nested under the current section. This array follows the same structure as the top-level `sections` array, allowing for the representation of multiple levels of nested content. Each subsection object also includes `header_level`, `header_name`, `content`, and `ordinal_position` properties, enabling a recursive structure that represents and hierarchy of the Markdown content.
231259

232-
Consider the following Markdown content. We use this content to explain an index schema that's designed around it, and what the search documents might look like for each parsing mode.
260+
Here's the sample Markdown that we're using to explain an index schema that's designed around each parsing mode.
233261

234262
```md
235263
# Section 1
@@ -242,8 +270,9 @@ Content for subsection 1.1.
242270
Content for section 2.
243271
```
244272

245-
### Index schema for one-to-one parsed Markdown files
246-
If you are not utilizing field mappings, the shape of the index should reflect the shape of the Markdown content. Based on the previous Markdown, the index should look similar to the following example:
273+
### Index schema for one-to-one parsed Markdown files
274+
275+
If you aren't utilizing field mappings, the shape of the index should reflect the shape of the Markdown content. Given the structure of sample Markdown with its two sections and single subsection, the index should look similar to the following example:
247276
```http
248277
{
249278
"name": "my-markdown-index",
@@ -329,7 +358,7 @@ As you can see, the ordinal position increments based on the location of the con
329358
It should also be noted that if header levels are skipped in the content, then structure of the resulting document reflects the headers that are present in the Markdown content, not necessarily containing nested sections for `h1` through `h6` consecutively. For example, when the document begins at `h2`, then the first element in the top-level sections array is `h2`.
330359

331360
```http
332-
POST https://[service name].search.windows.net/indexers?api-version=2024-07-01
361+
POST https://[service name].search.windows.net/indexers?api-version=2024-11-01-preview
333362
Content-Type: application/json
334363
api-key: [admin key]
335364
@@ -345,7 +374,9 @@ api-key: [admin key]
345374
}
346375
}
347376
```
348-
## Map Markdown one-to-one fields to search fields
377+
378+
## Map one-to-one fields to search fields
379+
349380
If you would like to extract fields with custom names from the document, you can use field mappings to do so. Using the same Markdown sample as before, consider the following index configuration:
350381

351382
```http
@@ -372,7 +403,7 @@ If you would like to extract fields with custom names from the document, you can
372403
}
373404
```
374405

375-
Extracting specific fields from the parsed Markdown is handled similar to how the document paths are in (outputFieldMappings)[https://learn.microsoft.com/en-us/azure/search/cognitive-search-output-field-mapping?tabs=rest], except the path begins with `/sections` instead of `/document`. So, for example, `/sections/0/content` would map to the content under the item at position 0 in the sections array.
406+
Extracting specific fields from the parsed Markdown is handled similar to how the document paths are in [outputFieldMappings](cognitive-search-output-field-mapping.md), except the path begins with `/sections` instead of `/document`. So, for example, `/sections/0/content` would map to the content under the item at position 0 in the sections array.
376407

377408
An example of a strong use case might look something like this: all Markdown files have a document title in the first `h1`, a subsection title in the first `h2`, and a summary in the content of the final paragraph underneath the final `h1`. You could use the following field mappings to index only that content:
378409

0 commit comments

Comments
 (0)