You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The **Document Layout** skill analyzes a document to extract regions of interest and their inter-relationships to produce a syntactical representation (markdown format). This skill uses the [Document Intelligence layout model](/azure/ai-services/document-intelligence/concept-layout) provided in [Azure AI Document Intelligence](/azure/ai-services/document-intelligence/overview).
20
+
The **Document Layout** skill analyzes a document to extract regions of interest and their inter-relationships to produce a syntactical representation of the document in Markdown format. This skill uses the [Document Intelligence layout model](/azure/ai-services/document-intelligence/concept-layout) provided in [Azure AI Document Intelligence](/azure/ai-services/document-intelligence/overview).
21
21
22
22
This article is the reference documentation for the Document Layout skill.
+ For PDF and TIFF, up to 2,000 pages can be processed (with a free tier subscription, only the first two pages are processed).
55
-
+ Even if the file size for analyzing documents is 500 MB for [Azure AI Document Intelligence paid (S0) tier](https://azure.microsoft.com/pricing/details/cognitive-services/) and 4 MB for [Azure AI Document Intelligence free (F0) tier](https://azure.microsoft.com/pricing/details/cognitive-services/), to use this functionality you need to take into consideration the file limits allowed by your indexer based on the [indexer limits](search-limits-quotas-capacity.md#indexer-limits)associated to your Azure AI Search service tier.
56
-
+ Image dimensions must be between 50 pixels x 50 pixels and 10,000 pixels x 10,000 pixels.
57
-
+ If your PDFs are password-locked, you must remove the lock before submission.
55
+
+ Even if the file size for analyzing documents is 500 MB for [Azure AI Document Intelligence paid (S0) tier](https://azure.microsoft.com/pricing/details/cognitive-services/) and 4 MB for [Azure AI Document Intelligence free (F0) tier](https://azure.microsoft.com/pricing/details/cognitive-services/), indexing is subject to the [indexer limits](search-limits-quotas-capacity.md#indexer-limits)of your search service tier.
56
+
+ Image dimensions must be between 50 pixels x 50 pixels or 10,000 pixels x 10,000 pixels.
57
+
+ If your PDFs are password-locked, remove the lock before running the indexer.
|`outputMode`|`oneToMany`| Controls the cardinality of the output produced by the skill. |
66
-
|`markdownHeaderDepth`|`h1`, `h2`, `h3`, `h4`, `h5`, `h6(default)`| This parameter describes the deepest nesting level that should be considered. For instance, if the markdownHeaderDepth is indicated as “h3” any markdown section that’s deeper than h3 (that is, #### and deeper) is considered as "content" that needs to be added to whatever level its parent is at. |
66
+
|`markdownHeaderDepth`|`h1`, `h2`, `h3`, `h4`, `h5`, `h6(default)`| This parameter describes the deepest nesting level that should be considered. For instance, if the markdownHeaderDepth is indicated as "h3" any markdown section that’s deeper than h3 (that is, #### and deeper) is considered as "content" that needs to be added to whatever level its parent is at. |
67
67
68
68
## Skill inputs
69
69
@@ -90,17 +90,17 @@ Alternatively, it can be defined as:
90
90
}
91
91
```
92
92
93
-
The file reference object can be generated one of following ways:
93
+
The file reference object can be generated in one of following ways:
94
94
95
-
+ Setting the `allowSkillsetToReadFileData` parameter on your indexer definition to "true." This setting creates a path `/document/file_data` that is an object representing the original file data downloaded from your blob data source. This parameter only applies to files in Blob storage.
95
+
+ Setting the `allowSkillsetToReadFileData` parameter on your indexer definition to true. This setting creates a path `/document/file_data` that's an object representing the original file data downloaded from your blob data source. This parameter only applies to files in Azure Blob storage.
96
96
97
-
+ Having a custom skill return a json object defined EXACTLY as above. The `$type` parameter must be set to exactly `file` and the`data`parameter must be the base 64encoded byte array data of the file content, or the `url` parameter must be a correctly formatted URL with access to download the file at that location.
97
+
+ Having a custom skill returning a JSON object defined that provides `$type`, `data`, or `url` and `sastoken`. The `$type` parameter must be set to `file`, and `data` must be the base 64-encoded byte array of the file content. The `url` parameter must be a valid URL with access for downloading the file at that location.
98
98
99
99
## Skill outputs
100
100
101
101
| Output name | Description |
102
102
|---------------|-------------------------------|
103
-
|`markdown_document`| A collection of "sections" objects, which represent each individual section in the markdown document.|
103
+
|`markdown_document`| A collection of "sections" objects, which represent each individual section in the Markdown document.|
104
104
105
105
## Sample definition
106
106
@@ -130,8 +130,6 @@ The file reference object can be generated one of following ways:
130
130
}
131
131
```
132
132
133
-
<aname="sample-output"></a>
134
-
135
133
## Sample output
136
134
137
135
```json
@@ -159,7 +157,7 @@ The file reference object can be generated one of following ways:
159
157
}
160
158
```
161
159
162
-
The value of the "deepestSection" parameter controls the number of keys in the 'sections' dictionary. In the example skill definition, since the deepestSection was specified as “h3”, there are three keys in the "sections" dictionary – h1, h2, h3.
160
+
The value of the `markdownHeaderDepth`controls the number of keys in the "sections" dictionary. In the example skill definition, since the `markdownHeaderDepth` is "h3", there are three keys in the "sections" dictionary: h1, h2, h3.
Copy file name to clipboardExpand all lines: articles/search/search-how-to-semantic-chunking.md
+45-30Lines changed: 45 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,54 +1,62 @@
1
1
---
2
-
title: Semantic Chunking and vectorization
2
+
title: Structure-aware chunking and vectorization
3
3
titleSuffix: Azure AI Search
4
-
description: Use skill set and index projection to do semantic chunking, vectorization, and write into a search index in Azure AI Search pipelines.
4
+
description: Chunk text content by paragraph or semantically coherent fragment. You can then apply integrated vectorization to generate embeddings and send the results to a searchable index.
5
5
author: rawan
6
6
ms.author: rawan
7
7
ms.service: azure-ai-search
8
8
ms.topic: how-to
9
-
ms.date: 10/12/2024
9
+
ms.date: 11/19/2024
10
10
ms.custom:
11
11
- references_regions
12
12
---
13
13
14
-
# Semantic chunking and vectorization using the Document Layout skill and index projections
14
+
# Structure-aware chunking and vectorization in Azure AI Search
Text data chunking strategies play a key role in optimizing the RAG response and performance. Semantic chunking is to find semantically coherent fragments of a sentence representation. These fragments can then be processed independently and recombined as semantic representations without loss of information, interpretation, or semantic relevance. The inherent meaning of the text is used as a guide for the chunking process. Markdown is a structured and formatted markup language and a popular input for enabling semantic chunking in RAG (Retrieval-Augmented Generation)
18
+
Text data chunking strategies play a key role in optimizing RAG responses and performance. By using the new Document Layout skill that's currently in preview, you can chunk content based on paragraphs or semantically coherent fragments of a sentence representation. These fragments can then be processed independently and recombined as semantic representations without loss of information, interpretation, or semantic relevance. The inherent meaning of the text is used as a guide for the chunking process.
19
19
20
-
The Document Layout skill offers a comprehensive solution for advanced content extraction and chunk functionality. With the Layout skill, you can easily extract document layout and content as markdown format and utilize markdown parsing mode to produce a set of document chunks
20
+
The Document Layout skill uses Markdown syntax (headings and content) to articulate document structure in the search document. The searchable content obtained from your source document is plain text but you can add integrated vectorization to generate embeddings for any field.
21
21
22
-
This article shows:
23
-
+ How to use the Document Layout skill to extract markdown sections
24
-
+ How to apply split skill to constrain chunk size within each markdown section
25
-
+ Generate embeddings for the content within those sections
26
-
+ How to use index projections to compile and write them into a search index.
22
+
In this article, learn how to:
23
+
24
+
> [!div class="checklist"]
25
+
> + Use the Document Layout skill to detect sections and output Markdown content
26
+
> + Use the Text Split skill to constrain chunk size to each markdown section
27
+
> + Generate embeddings for each chunk
28
+
> + Use index projections to map embeddings to fields in a search index
27
29
28
30
## Prerequisites
29
31
30
-
+ An [indexer-based indexing pipeline](search-indexer-overview.md).
31
-
+ An index that accepts the output of the indexer pipeline.
32
-
+ A [supported data source](search-indexer-overview.md#supported-data-sources) having content that you want to chunk.
33
-
+ A [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) that splits documents based on paragraph boundaries.
34
-
+ An [Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md) that generates vector embeddings
35
-
+ An [index projection](search-how-to-define-index-projections.md) for one-to-many indexing
32
+
+[An indexer-based indexing pipeline](search-indexer-overview.md) with an index that accepts the output.
33
+
+[A supported data source](search-indexer-overview.md#supported-data-sources) having text content that you want to chunk.
34
+
+[A skillset with Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) that splits documents based on paragraph boundaries.
35
+
+[An Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md) that generates vector embeddings.
36
+
+[An index projection](search-how-to-define-index-projections.md) for one-to-many indexing.
36
37
37
38
## Prepare data files
38
39
39
40
The raw inputs must be in a [supported data source](search-indexer-overview.md#supported-data-sources) and the file needs to be a format which [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) supports.
+ Supported indexers can be any indexer that can handle the supported file formats. These include [Blob indexers](search-howto-indexing-azure-blob-storage.md), [OneLake indexers](search-how-to-index-onelake-files.md), [File indexers](search-file-storage-integration.md).
45
+
46
+
+ Supported regions for this feature include: East US, West US2, West Europe, North Central US. Be sure to [check this list](search-region-support.md#azure-public-regions) for updates on regional availability.
42
47
43
-
You can use the Azure portal, REST APIs, or an Azure SDK to [create a data source](search-howto-indexing-azure-blob-storage.md).
48
+
You can use the Azure portal, REST APIs, or an Azure SDK package to [create a data source](search-howto-indexing-azure-blob-storage.md).
49
+
50
+
> [!TIP]
51
+
> Upload the [health plan PDF](https://github.com/Azure-Samples/azure-search-sample-data/tree/main/health-plan) sample files to your supported data source to try out the Document Layout skill and structure-aware chunking on your own search service. The [Import and vectorize data](search-get-started-portal-import-vectors.md) wizard is an easy code-free approach for trying out this skill. Be sure to select the **default parsing mode** to use structure-aware chunking. Otherwise, the [Markdown parsing mode](search-how-to-index-markdown-blobs.md) is used instead.
44
52
45
53
## Create an index for one-to-many indexing
46
54
47
-
Here's an example payload of a single designed around chunks. In this example, parent fields are the text_parent_id. Child fields are the vector and nonvector chunk of the markdown section.
55
+
Here's an example payload of a single search document designed around chunks. In this example, parent fields are the text_parent_id. Child fields are the vector and nonvector chunks of the markdown section.
48
56
49
57
You can use the Azure portal, REST APIs, or an Azure SDK to [create an index](search-how-to-load-search-index.md).
50
58
51
-
An index must exist on the search service before you create the skill set or run the indexer
59
+
An index must exist on the search service before you create the skill set or run the indexer.
52
60
53
61
```json
54
62
{
@@ -164,16 +172,18 @@ An index must exist on the search service before you create the skill set or run
164
172
}
165
173
```
166
174
167
-
## Define skill set for semantic chunking and vectorization
175
+
## Define skill set for structure-aware chunking and vectorization
168
176
169
-
You can use the REST APIs to [create or update a skill set](cognitive-search-defining-skillset.md).
177
+
Because the Document Layout skill is in preview, you must use the [Create Skillset 2024-11-01-preview](/rest/api/searchservice/skillsets/create?view=rest-searchservice-2024-11-01-preview&preserve-view=true) REST API for this step.
170
178
171
179
Here's an example skill set definition payload to project individual markdown sections chunks and their vector outputs as documents in the search index using the [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) and [Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md)
172
180
173
-
```json
181
+
```https
182
+
POST {endpoint}/skillsets?api-version=2024-11-01-preview
183
+
174
184
{
175
185
"name": "my_skillset",
176
-
"description": "A skillset for semantic chunking and vectorization with a indexprojection around markdown section",
186
+
"description": "A skillset for structure-aware chunking and vectorization with a index projection around markdown section",
@@ -288,11 +298,13 @@ Here's an example skill set definition payload to project individual markdown se
288
298
```
289
299
290
300
## Run the indexer
291
-
Once you create a data source, indexes, and skill set, you're ready to [create and run the indexer](search-howto-create-indexers.md#run-the-indexer). This step puts the pipeline into execution.
301
+
302
+
Once you create a data source, index, and skillset, you're ready to [create and run the indexer](search-howto-create-indexers.md#run-the-indexer). This step puts the pipeline into execution.
292
303
293
304
When using the [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md), make sure to set the following parameters on the indexer definition:
294
-
+ The `allowSkillsetToReadFileData` parameter should be set to "true."
295
-
+ the `parsingMode` parameter should be set to "default."
305
+
306
+
+ The `allowSkillsetToReadFileData` parameter should be set to `true`.
307
+
+ the `parsingMode` parameter should be set to `default`.
296
308
297
309
Here's an example payload
298
310
@@ -321,10 +333,13 @@ Here's an example payload
321
333
```
322
334
323
335
## Verify results
336
+
324
337
You can query your search index after processing concludes to test your solution.
325
338
326
339
To check the results, run a query against the index. Use [Search Explorer](search-explorer.md) as a search client, or any tool that sends HTTP requests. The following query selects fields that contain the output of markdown section nonvector content and its vector.
327
340
341
+
For Search Explorer, you can copy just the JSON and paste it into the JSON view for query execution.
342
+
328
343
```http
329
344
POST /indexes/[index name]/docs/search?api-version=[api-version]
330
345
{
@@ -335,11 +350,11 @@ POST /indexes/[index name]/docs/search?api-version=[api-version]
335
350
336
351
## See also
337
352
353
+
+[Create or update a skill set](cognitive-search-defining-skillset.md).
338
354
+[Create a data source](search-howto-indexing-azure-blob-storage.md)
339
355
+[Define an index projection](search-how-to-define-index-projections.md)
340
-
+[How to define a skill set](cognitive-search-defining-skillset.md)
0 commit comments