Skip to content

Commit 3a16299

Browse files
Merge pull request #1491 from HeidiSteen/heidist-ignite
[release-azure-search] Template pass over markdown parsing how-to
2 parents 3c5640e + da36b21 commit 3a16299

File tree

1 file changed

+82
-41
lines changed

1 file changed

+82
-41
lines changed

articles/search/search-how-to-index-markdown-blobs.md

Lines changed: 82 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -17,25 +17,54 @@ ms.date: 11/19/2024
1717

1818
[!INCLUDE [Feature preview](./includes/previews/preview-generic.md)]
1919

20-
**Applies to**: [Blob indexers](search-howto-indexing-azure-blob-storage.md), [OneLake indexers](search-how-to-index-onelake-files.md), [File indexers](search-file-storage-integration.md)
20+
In Azure AI Search, indexers for Azure Blob Storage, Azure Files, and OneLake support a `markdown` parsing mode for Markdown files. Markdown files can be indexed in two ways:
2121

22-
In Azure AI Search, indexers for Azure Blob Storage and Azure Files support a `markdown` parsing mode for Markdown files. Markdown files can be indexed in two ways:
22+
+ One-to-many parsing mode, creating multiple search documents per Markdown file
23+
+ One-to-one parsing mode, creating one search document per Markdown file
2324

24-
+ One-to-many parsing mode
25-
+ One-to-one parsing mode
25+
## Prerequisites
26+
27+
+ A supported data source: Azure Blob storage, Azure File storage, OneLake in Microsoft Fabric.
28+
29+
For OneLake, make sure you meet all of the requirements of the [OneLake indexer](search-how-to-index-onelake-files.md#prerequisites).
30+
31+
Azure Storage for [blob indexers](search-howto-indexing-azure-blob-storage.md#prerequisites) and [file indexers](search-file-storage-integration.md#prerequisites) is a standard performance (general-purpose v2) instance that supports hot, cool, and cold access tiers.
32+
33+
## Markdown parsing mode parameters
34+
35+
Parsing mode parameters are specified in an indexer definition when you create or update an indexer.
36+
37+
```http
38+
POST https://[service name].search.windows.net/indexers?api-version=2024-11-01-preview
39+
Content-Type: application/json
40+
api-key: [admin key]
41+
42+
{
43+
"name": "my-markdown-indexer",
44+
"dataSourceName": "my-blob-datasource",
45+
"targetIndexName": "my-target-index",
46+
"parameters": {
47+
"configuration": {
48+
"parsingMode": "markdown",
49+
"markdownParsingSubmode": "oneToMany",
50+
"markdownHeaderDepth": "h6"
51+
}
52+
},
53+
}
54+
```
2655

2756
The blob indexer provides a `submode` parameter to determine the output of structure of the search documents. Markdown parsing mode provides the following submode options:
2857

2958
| parsingMode | submode | Search document | Description |
3059
|--------------|-------------|-------------|--------------|
31-
| **`markdown`** | **`oneToMany`** | Multiple per blob | (default) Breaks the Markdown into multiple search documents, each representing a content (nonheader) section of the Markdown file. |
60+
| **`markdown`** | **`oneToMany`** | Multiple per blob | (default) Breaks the Markdown into multiple search documents, each representing a content (nonheader) section of the Markdown file. You can omit submode unless you want one-to-one parsing.|
3261
| **`markdown`** | **`oneToOne`** | One per blob | Parses the Markdown into one search document, with sections mapped to specific headers in the Markdown file.|
3362

3463
For **`oneToMany`** submode, you should review [Indexing one blob to produce many search documents](search-howto-index-one-to-many-blobs.md) to understand how the blob indexer handles disambiguation of the document key for multiple search documents produced from the same blob.
3564

3665
Later sections describe each submode in more detail. If you're unfamiliar with indexer clients and concepts, see [Create a search indexer](search-howto-create-indexers.md). You should also be familiar with the details of [basic blob indexer configuration](search-howto-indexing-azure-blob-storage.md), which isn't repeated here.
3766

38-
## Additional Markdown parsing parameters
67+
### Optional Markdown parsing parameters
3968

4069
Parameters are case-sensitive.
4170

@@ -46,13 +75,14 @@ Parameters are case-sensitive.
4675
This setting can be changed after initial creation of the indexer, however the structure of the resulting search documents might change depending on the Markdown content.
4776

4877
## Supported Markdown elements
49-
Markdown parsing will only split content based on headers. All other elements such as lists, code blocks, tables, etc., are treated as plaintext.
78+
79+
Markdown parsing will only split content based on headers. All other elements such as lists, code blocks, tables, and so forth, are treated as plain text and passed into a content field.
5080

5181
<a name="parsing-markdown-one-to-many"></a>
5282

5383
## Sample Markdown content
5484

55-
The following Markdown content will be used for the examples on this page:
85+
The following Markdown content is used for the examples on this page:
5686

5787
```md
5888
# Section 1
@@ -65,9 +95,9 @@ Content for subsection 1.1.
6595
Content for section 2.
6696
```
6797

68-
## Markdown one-to-many parsing mode (Markdown to Multiple Documents)
98+
## Use one-to-many parsing mode
6999

70-
The **Markdown one-to-many parsing mode** parses Markdown files into multiple search documents, where each document corresponds to a specific content section of the Markdown file based on the header metadata at that point in the document. The Markdown is parsed based on headers into documents which contain the following content:
100+
The one-to-many parsing mode parses Markdown files into multiple search documents, where each document corresponds to a specific content section of the Markdown file based on the header metadata at that point in the document. The Markdown is parsed based on headers into search documents which contain the following content:
71101

72102
- `content`: A string that contains the raw Markdown found in a specific location, based on the header metadata at that point in the document.
73103

@@ -81,8 +111,7 @@ The **Markdown one-to-many parsing mode** parses Markdown files into multiple se
81111

82112
- `ordinal_position`: An integer value indicating the position of the section within the document hierarchy. This field is used for ordering the sections in their original sequence as they appear in the document, beginning with an ordinal position of 1 and incrementing sequentially for each header.
83113

84-
85-
### Index schema for one-to-many parsed Markdown files
114+
### Index schema for one-to-many parsing
86115

87116
An example index configuration might look something like this:
88117
```http
@@ -118,7 +147,10 @@ An example index configuration might look something like this:
118147
}
119148
```
120149

121-
The blob indexer can infer the mapping without a field mapping present in the request, so an indexer configuration corresponding to the provided index configuration might look like this:
150+
### Indexer definition for one-to-many parsing
151+
152+
If field names and data types align, the blob indexer can infer the mapping without an explicit field mapping present in the request, so an indexer configuration corresponding to the provided index configuration might look like this:
153+
122154
```http
123155
POST https://[service name].search.windows.net/indexers?api-version=2024-11-01-preview
124156
Content-Type: application/json
@@ -137,7 +169,9 @@ api-key: [admin key]
137169
> [!NOTE]
138170
> The `submode` does not need to be set explicitly here because `oneToMany` is the default.
139171
140-
This Markdown file would result in three search documents after indexing, due to the three content sections. The search document resulting from the first content section of the provided Markdown document would contain the following values for `content`, `sections`, `h1`, and `h2`:
172+
### Indexer output for one-to-many parsing
173+
174+
This Markdown file would result in three search documents after indexing, due to the three content sections. The search document resulting from the first content section of the provided Markdown document would contain the following values for `content`, `sections`, `h1`, and `h2`:
141175

142176
```http
143177
{
@@ -168,7 +202,7 @@ api-key: [admin key]
168202
}
169203
```
170204

171-
## Map Markdown one-to-many fields to search fields
205+
### Map one-to-many fields in a search index
172206

173207
Field mappings associate a source field with a destination field in situations where the field names and types aren't identical. But field mappings can also be used to match parts of a Markdown document and "lift" them into top-level fields of the search document.
174208

@@ -207,13 +241,13 @@ The resulting search document in the index would look as follows:
207241

208242
<a name="parsing-markdown-one-to-one"></a>
209243

210-
## Markdown one-to-one parsing mode (Markdown to a single document)
244+
## Use one-to-one parsing mode
211245

212-
In **Markdown one-to-one parsing mode**, the entire Markdown document is indexed as a single search document, preserving the hierarchy and structure of the original content. This mode is most useful when the files to be indexed share a common structure, so that you can leverage this common structure in the index to make the relevant fields searchable.
246+
In the one-to-one parsing mode, the entire Markdown document is indexed as a single search document, preserving the hierarchy and structure of the original content. This mode is most useful when the files to be indexed share a common structure, so that you can use this common structure in the index to make the relevant fields searchable.
213247

214-
Within the indexer definition, set the `parsingMode` to "Markdown" and use the optional `markdownHeaderDepth` parameter to define the maximum heading depth for chunking. If not specified, it defaults to `h6`, capturing all possible header depths.
248+
Within the indexer definition, set the `parsingMode` to `"markdown"` and use the optional `markdownHeaderDepth` parameter to define the maximum heading depth for chunking. If not specified, it defaults to `h6`, capturing all possible header depths.
215249

216-
The Markdown is parsed based on headers into documents which contain the following content:
250+
The Markdown is parsed based on headers into search documents which contain the following content:
217251

218252
- `document_content`: Contains the full Markdown text as a single string. This field serves as a raw representation of the input document.
219253

@@ -223,13 +257,13 @@ The Markdown is parsed based on headers into documents which contain the followi
223257

224258
- `header_name`: A string containing the text of the header as it appears in the Markdown document. This field provides a label or title for the section.
225259

226-
- `content`: A string containing text content that immediately follows the header, up to the next header. This field captures the detailed information or description associated with the header. If there is no content directly under a header, this is an empty string.
260+
- `content`: A string containing text content that immediately follows the header, up to the next header. This field captures the detailed information or description associated with the header. If there's no content directly under a header, this is an empty string.
227261

228262
- `ordinal_position`: An integer value indicating the position of the section within the document hierarchy. This field is used for ordering the sections in their original sequence as they appear in the document, beginning with an ordinal position of 1 and incrementing sequentially for each content block.
229263

230264
- `sections`: An array that contains objects representing subsections nested under the current section. This array follows the same structure as the top-level `sections` array, allowing for the representation of multiple levels of nested content. Each subsection object also includes `header_level`, `header_name`, `content`, and `ordinal_position` properties, enabling a recursive structure that represents and hierarchy of the Markdown content.
231265

232-
Consider the following Markdown content. We use this content to explain an index schema that's designed around it, and what the search documents might look like for each parsing mode.
266+
Here's the sample Markdown that we're using to explain an index schema that's designed around each parsing mode.
233267

234268
```md
235269
# Section 1
@@ -242,8 +276,9 @@ Content for subsection 1.1.
242276
Content for section 2.
243277
```
244278

245-
### Index schema for one-to-one parsed Markdown files
246-
If you are not utilizing field mappings, the shape of the index should reflect the shape of the Markdown content. Based on the previous Markdown, the index should look similar to the following example:
279+
### Index schema for one-to-one parsing
280+
281+
If you aren't utilizing field mappings, the shape of the index should reflect the shape of the Markdown content. Given the structure of sample Markdown with its two sections and single subsection, the index should look similar to the following example:
247282
```http
248283
{
249284
"name": "my-markdown-index",
@@ -296,6 +331,28 @@ If you are not utilizing field mappings, the shape of the index should reflect t
296331
}
297332
```
298333

334+
### Indexer definition for one-to-one parsing
335+
336+
```http
337+
POST https://[service name].search.windows.net/indexers?api-version=2024-11-01-preview
338+
Content-Type: application/json
339+
api-key: [admin key]
340+
341+
{
342+
"name": "my-markdown-indexer",
343+
"dataSourceName": "my-blob-datasource",
344+
"targetIndexName": "my-target-index",
345+
"parameters": {
346+
"configuration": {
347+
"parsingMode": "markdown",
348+
"markdownParsingSubmode": "oneToOne",
349+
}
350+
}
351+
}
352+
```
353+
354+
### Indexer output for one-to-one parsing
355+
299356
Because the Markdown we want to index only goes to a depth of `h2` ("##"), we need `sections` fields nested to a depth of 2 to match that. This configuration would result in the following data in the index:
300357

301358
```http
@@ -328,24 +385,8 @@ As you can see, the ordinal position increments based on the location of the con
328385

329386
It should also be noted that if header levels are skipped in the content, then structure of the resulting document reflects the headers that are present in the Markdown content, not necessarily containing nested sections for `h1` through `h6` consecutively. For example, when the document begins at `h2`, then the first element in the top-level sections array is `h2`.
330387

331-
```http
332-
POST https://[service name].search.windows.net/indexers?api-version=2024-07-01
333-
Content-Type: application/json
334-
api-key: [admin key]
388+
### Map one-to-one fields in a search index
335389

336-
{
337-
"name": "my-markdown-indexer",
338-
"dataSourceName": "my-blob-datasource",
339-
"targetIndexName": "my-target-index",
340-
"parameters": {
341-
"configuration": {
342-
"parsingMode": "markdown",
343-
"markdownParsingSubmode": "oneToMany",
344-
}
345-
}
346-
}
347-
```
348-
## Map Markdown one-to-one fields to search fields
349390
If you would like to extract fields with custom names from the document, you can use field mappings to do so. Using the same Markdown sample as before, consider the following index configuration:
350391

351392
```http
@@ -372,7 +413,7 @@ If you would like to extract fields with custom names from the document, you can
372413
}
373414
```
374415

375-
Extracting specific fields from the parsed Markdown is handled similar to how the document paths are in (outputFieldMappings)[https://learn.microsoft.com/en-us/azure/search/cognitive-search-output-field-mapping?tabs=rest], except the path begins with `/sections` instead of `/document`. So, for example, `/sections/0/content` would map to the content under the item at position 0 in the sections array.
416+
Extracting specific fields from the parsed Markdown is handled similar to how the document paths are in [outputFieldMappings](cognitive-search-output-field-mapping.md), except the path begins with `/sections` instead of `/document`. So, for example, `/sections/0/content` would map to the content under the item at position 0 in the sections array.
376417

377418
An example of a strong use case might look something like this: all Markdown files have a document title in the first `h1`, a subsection title in the first `h2`, and a summary in the content of the final paragraph underneath the final `h1`. You could use the following field mappings to index only that content:
378419

0 commit comments

Comments
 (0)