Skip to content

Commit 468475b

Browse files
authored
Merge pull request #1365 from HeidiSteen/heidist-ignite
[release-azure-search] Doc layout skill edit pass for standardization and consistency
2 parents 6b26548 + b61b743 commit 468475b

File tree

3 files changed

+62
-49
lines changed

3 files changed

+62
-49
lines changed

articles/search/cognitive-search-skill-document-intelligence-layout.md

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ ms.date: 11/19/2024
1717

1818
[!INCLUDE [Feature preview](./includes/previews/preview-generic.md)]
1919

20-
The **Document Layout** skill analyzes a document to extract regions of interest and their inter-relationships to produce a syntactical representation (markdown format). This skill uses the [Document Intelligence layout model](/azure/ai-services/document-intelligence/concept-layout) provided in [Azure AI Document Intelligence](/azure/ai-services/document-intelligence/overview).
20+
The **Document Layout** skill analyzes a document to extract regions of interest and their inter-relationships to produce a syntactical representation of the document in Markdown format. This skill uses the [Document Intelligence layout model](/azure/ai-services/document-intelligence/concept-layout) provided in [Azure AI Document Intelligence](/azure/ai-services/document-intelligence/overview).
2121

2222
This article is the reference documentation for the Document Layout skill.
2323

@@ -52,9 +52,9 @@ Microsoft.Skills.Util.DocumentIntelligenceLayoutSkill
5252
## Data limits
5353

5454
+ For PDF and TIFF, up to 2,000 pages can be processed (with a free tier subscription, only the first two pages are processed).
55-
+ Even if the file size for analyzing documents is 500 MB for [Azure AI Document Intelligence paid (S0) tier](https://azure.microsoft.com/pricing/details/cognitive-services/) and 4 MB for [Azure AI Document Intelligence free (F0) tier](https://azure.microsoft.com/pricing/details/cognitive-services/), to use this functionality you need to take into consideration the file limits allowed by your indexer based on the [indexer limits](search-limits-quotas-capacity.md#indexer-limits) associated to your Azure AI Search service tier.
56-
+ Image dimensions must be between 50 pixels x 50 pixels and 10,000 pixels x 10,000 pixels.
57-
+ If your PDFs are password-locked, you must remove the lock before submission.
55+
+ Even if the file size for analyzing documents is 500 MB for [Azure AI Document Intelligence paid (S0) tier](https://azure.microsoft.com/pricing/details/cognitive-services/) and 4 MB for [Azure AI Document Intelligence free (F0) tier](https://azure.microsoft.com/pricing/details/cognitive-services/), indexing is subject to the [indexer limits](search-limits-quotas-capacity.md#indexer-limits) of your search service tier.
56+
+ Image dimensions must be between 50 pixels x 50 pixels or 10,000 pixels x 10,000 pixels.
57+
+ If your PDFs are password-locked, remove the lock before running the indexer.
5858

5959
## Skill parameters
6060

@@ -63,7 +63,7 @@ Parameters are case-sensitive.
6363
| Parameter name | Allowed Values | Description |
6464
|--------------------|-------------|-------------|
6565
| `outputMode` | `oneToMany` | Controls the cardinality of the output produced by the skill. |
66-
| `markdownHeaderDepth` |`h1`, `h2`, `h3`, `h4`, `h5`, `h6(default)` | This parameter describes the deepest nesting level that should be considered. For instance, if the markdownHeaderDepth is indicated as “h3” any markdown section that’s deeper than h3 (that is, #### and deeper) is considered as "content" that needs to be added to whatever level its parent is at. |
66+
| `markdownHeaderDepth` |`h1`, `h2`, `h3`, `h4`, `h5`, `h6(default)` | This parameter describes the deepest nesting level that should be considered. For instance, if the markdownHeaderDepth is indicated as "h3" any markdown section that’s deeper than h3 (that is, #### and deeper) is considered as "content" that needs to be added to whatever level its parent is at. |
6767

6868
## Skill inputs
6969

@@ -90,17 +90,17 @@ Alternatively, it can be defined as:
9090
}
9191
```
9292

93-
The file reference object can be generated one of following ways:
93+
The file reference object can be generated in one of following ways:
9494

95-
+ Setting the `allowSkillsetToReadFileData` parameter on your indexer definition to "true." This setting creates a path `/document/file_data` that is an object representing the original file data downloaded from your blob data source. This parameter only applies to files in Blob storage.
95+
+ Setting the `allowSkillsetToReadFileData` parameter on your indexer definition to true. This setting creates a path `/document/file_data` that's an object representing the original file data downloaded from your blob data source. This parameter only applies to files in Azure Blob storage.
9696

97-
+ Having a custom skill return a json object defined EXACTLY as above. The `$type` parameter must be set to exactly `file` and the `data` parameter must be the base 64 encoded byte array data of the file content, or the `url` parameter must be a correctly formatted URL with access to download the file at that location.
97+
+ Having a custom skill returning a JSON object defined that provides `$type`, `data`, or `url` and `sastoken`. The `$type` parameter must be set to `file`, and `data` must be the base 64-encoded byte array of the file content. The `url` parameter must be a valid URL with access for downloading the file at that location.
9898

9999
## Skill outputs
100100

101101
| Output name | Description |
102102
|---------------|-------------------------------|
103-
| `markdown_document` | A collection of "sections" objects, which represent each individual section in the markdown document.|
103+
| `markdown_document` | A collection of "sections" objects, which represent each individual section in the Markdown document.|
104104

105105
## Sample definition
106106

@@ -130,8 +130,6 @@ The file reference object can be generated one of following ways:
130130
}
131131
```
132132

133-
<a name="sample-output"></a>
134-
135133
## Sample output
136134

137135
```json
@@ -159,7 +157,7 @@ The file reference object can be generated one of following ways:
159157
}
160158
```
161159

162-
The value of the "deepestSection" parameter controls the number of keys in the 'sections' dictionary. In the example skill definition, since the deepestSection was specified as “h3”, there are three keys in the "sections" dictionaryh1, h2, h3.
160+
The value of the `markdownHeaderDepth` controls the number of keys in the "sections" dictionary. In the example skill definition, since the `markdownHeaderDepth` is "h3", there are three keys in the "sections" dictionary: h1, h2, h3.
163161

164162
## See also
165163

articles/search/search-how-to-semantic-chunking.md

Lines changed: 45 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,54 +1,62 @@
11
---
2-
title: Semantic Chunking and vectorization
2+
title: Structure-aware chunking and vectorization
33
titleSuffix: Azure AI Search
4-
description: Use skill set and index projection to do semantic chunking, vectorization, and write into a search index in Azure AI Search pipelines.
4+
description: Chunk text content by paragraph or semantically coherent fragment. You can then apply integrated vectorization to generate embeddings and send the results to a searchable index.
55
author: rawan
66
ms.author: rawan
77
ms.service: azure-ai-search
88
ms.topic: how-to
9-
ms.date: 10/12/2024
9+
ms.date: 11/19/2024
1010
ms.custom:
1111
- references_regions
1212
---
1313

14-
# Semantic chunking and vectorization using the Document Layout skill and index projections
14+
# Structure-aware chunking and vectorization in Azure AI Search
1515

1616
[!INCLUDE [Feature preview](./includes/previews/preview-generic.md)]
1717

18-
Text data chunking strategies play a key role in optimizing the RAG response and performance. Semantic chunking is to find semantically coherent fragments of a sentence representation. These fragments can then be processed independently and recombined as semantic representations without loss of information, interpretation, or semantic relevance. The inherent meaning of the text is used as a guide for the chunking process. Markdown is a structured and formatted markup language and a popular input for enabling semantic chunking in RAG (Retrieval-Augmented Generation)
18+
Text data chunking strategies play a key role in optimizing RAG responses and performance. By using the new Document Layout skill that's currently in preview, you can chunk content based on paragraphs or semantically coherent fragments of a sentence representation. These fragments can then be processed independently and recombined as semantic representations without loss of information, interpretation, or semantic relevance. The inherent meaning of the text is used as a guide for the chunking process.
1919

20-
The Document Layout skill offers a comprehensive solution for advanced content extraction and chunk functionality. With the Layout skill, you can easily extract document layout and content as markdown format and utilize markdown parsing mode to produce a set of document chunks
20+
The Document Layout skill uses Markdown syntax (headings and content) to articulate document structure in the search document. The searchable content obtained from your source document is plain text but you can add integrated vectorization to generate embeddings for any field.
2121

22-
This article shows:
23-
+ How to use the Document Layout skill to extract markdown sections
24-
+ How to apply split skill to constrain chunk size within each markdown section
25-
+ Generate embeddings for the content within those sections
26-
+ How to use index projections to compile and write them into a search index.
22+
In this article, learn how to:
23+
24+
> [!div class="checklist"]
25+
> + Use the Document Layout skill to detect sections and output Markdown content
26+
> + Use the Text Split skill to constrain chunk size to each markdown section
27+
> + Generate embeddings for each chunk
28+
> + Use index projections to map embeddings to fields in a search index
2729
2830
## Prerequisites
2931

30-
+ An [indexer-based indexing pipeline](search-indexer-overview.md).
31-
+ An index that accepts the output of the indexer pipeline.
32-
+ A [supported data source](search-indexer-overview.md#supported-data-sources) having content that you want to chunk.
33-
+ A [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) that splits documents based on paragraph boundaries.
34-
+ An [Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md) that generates vector embeddings
35-
+ An [index projection](search-how-to-define-index-projections.md) for one-to-many indexing
32+
+ [An indexer-based indexing pipeline](search-indexer-overview.md) with an index that accepts the output.
33+
+ [A supported data source](search-indexer-overview.md#supported-data-sources) having text content that you want to chunk.
34+
+ [A skillset with Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) that splits documents based on paragraph boundaries.
35+
+ [An Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md) that generates vector embeddings.
36+
+ [An index projection](search-how-to-define-index-projections.md) for one-to-many indexing.
3637

3738
## Prepare data files
3839

3940
The raw inputs must be in a [supported data source](search-indexer-overview.md#supported-data-sources) and the file needs to be a format which [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) supports.
4041

41-
+ Supported file format: PDF, JPEG, JPG, PNG, BMP, TIFF, DOCX, XLSX,PPTX,HTML
42+
+ Supported file formats include: PDF, JPEG, JPG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, HTML.
43+
44+
+ Supported indexers can be any indexer that can handle the supported file formats. These include [Blob indexers](search-howto-indexing-azure-blob-storage.md), [OneLake indexers](search-how-to-index-onelake-files.md), [File indexers](search-file-storage-integration.md).
45+
46+
+ Supported regions for this feature include: East US, West US2, West Europe, North Central US. Be sure to [check this list](search-region-support.md#azure-public-regions) for updates on regional availability.
4247

43-
You can use the Azure portal, REST APIs, or an Azure SDK to [create a data source](search-howto-indexing-azure-blob-storage.md).
48+
You can use the Azure portal, REST APIs, or an Azure SDK package to [create a data source](search-howto-indexing-azure-blob-storage.md).
49+
50+
> [!TIP]
51+
> Upload the [health plan PDF](https://github.com/Azure-Samples/azure-search-sample-data/tree/main/health-plan) sample files to your supported data source to try out the Document Layout skill and structure-aware chunking on your own search service. The [Import and vectorize data](search-get-started-portal-import-vectors.md) wizard is an easy code-free approach for trying out this skill. Be sure to select the **default parsing mode** to use structure-aware chunking. Otherwise, the [Markdown parsing mode](search-how-to-index-markdown-blobs.md) is used instead.
4452
4553
## Create an index for one-to-many indexing
4654

47-
Here's an example payload of a single designed around chunks. In this example, parent fields are the text_parent_id. Child fields are the vector and nonvector chunk of the markdown section.
55+
Here's an example payload of a single search document designed around chunks. In this example, parent fields are the text_parent_id. Child fields are the vector and nonvector chunks of the markdown section.
4856

4957
You can use the Azure portal, REST APIs, or an Azure SDK to [create an index](search-how-to-load-search-index.md).
5058

51-
An index must exist on the search service before you create the skill set or run the indexer
59+
An index must exist on the search service before you create the skill set or run the indexer.
5260

5361
```json
5462
{
@@ -164,16 +172,18 @@ An index must exist on the search service before you create the skill set or run
164172
}
165173
```
166174

167-
## Define skill set for semantic chunking and vectorization
175+
## Define skill set for structure-aware chunking and vectorization
168176

169-
You can use the REST APIs to [create or update a skill set](cognitive-search-defining-skillset.md).
177+
Because the Document Layout skill is in preview, you must use the [Create Skillset 2024-11-01-preview](/rest/api/searchservice/skillsets/create?view=rest-searchservice-2024-11-01-preview&preserve-view=true) REST API for this step.
170178

171179
Here's an example skill set definition payload to project individual markdown sections chunks and their vector outputs as documents in the search index using the [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) and [Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md)
172180

173-
```json
181+
```https
182+
POST {endpoint}/skillsets?api-version=2024-11-01-preview
183+
174184
{
175185
"name": "my_skillset",
176-
"description": "A skillset for semantic chunking and vectorization with a indexprojection around markdown section",
186+
"description": "A skillset for structure-aware chunking and vectorization with a index projection around markdown section",
177187
"skills": [
178188
{
179189
"@odata.type": "#Microsoft.Skills.Util.DocumentIntelligenceLayoutSkill",
@@ -288,11 +298,13 @@ Here's an example skill set definition payload to project individual markdown se
288298
```
289299

290300
## Run the indexer
291-
Once you create a data source, indexes, and skill set, you're ready to [create and run the indexer](search-howto-create-indexers.md#run-the-indexer). This step puts the pipeline into execution.
301+
302+
Once you create a data source, index, and skillset, you're ready to [create and run the indexer](search-howto-create-indexers.md#run-the-indexer). This step puts the pipeline into execution.
292303

293304
When using the [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md), make sure to set the following parameters on the indexer definition:
294-
+ The `allowSkillsetToReadFileData` parameter should be set to "true."
295-
+ the `parsingMode` parameter should be set to "default."
305+
306+
+ The `allowSkillsetToReadFileData` parameter should be set to `true`.
307+
+ the `parsingMode` parameter should be set to `default`.
296308

297309
Here's an example payload
298310

@@ -321,10 +333,13 @@ Here's an example payload
321333
```
322334

323335
## Verify results
336+
324337
You can query your search index after processing concludes to test your solution.
325338

326339
To check the results, run a query against the index. Use [Search Explorer](search-explorer.md) as a search client, or any tool that sends HTTP requests. The following query selects fields that contain the output of markdown section nonvector content and its vector.
327340

341+
For Search Explorer, you can copy just the JSON and paste it into the JSON view for query execution.
342+
328343
```http
329344
POST /indexes/[index name]/docs/search?api-version=[api-version]
330345
{
@@ -335,11 +350,11 @@ POST /indexes/[index name]/docs/search?api-version=[api-version]
335350

336351
## See also
337352

353+
+ [Create or update a skill set](cognitive-search-defining-skillset.md).
338354
+ [Create a data source](search-howto-indexing-azure-blob-storage.md)
339355
+ [Define an index projection](search-how-to-define-index-projections.md)
340-
+ [How to define a skill set](cognitive-search-defining-skillset.md)
341356
+ [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md)
357+
+ [Text Split skill](cognitive-search-skill-textsplit.md)
342358
+ [Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md)
343359
+ [Create indexer (REST)](/rest/api/searchservice/indexers/create)
344360
+ [Search Explorer](search-explorer.md)
345-

articles/search/toc.yml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -323,7 +323,7 @@ items:
323323
href: cognitive-search-concept-image-scenarios.md
324324
- name: Cache (incremental) enrichment
325325
href: search-howto-incremental-index.md
326-
- name: Semantic chunking and vectorization
326+
- name: Structure-aware chunking and vectorization
327327
href: search-how-to-semantic-chunking.md
328328
- name: Design tips
329329
href: cognitive-search-concept-troubleshooting.md
@@ -642,7 +642,9 @@ items:
642642
- name: Annotation reference language
643643
href: cognitive-search-skill-annotation-language.md
644644
- name: Azure AI resource skills
645-
items:
645+
items:
646+
- name: Document Layout skill (preview)
647+
href: cognitive-search-skill-document-intelligence-layout.md
646648
- name: Entity Linking (v3)
647649
href: cognitive-search-skill-entity-linking-v3.md
648650
- name: Entity Recognition (v3)
@@ -661,10 +663,8 @@ items:
661663
href: cognitive-search-skill-sentiment-v3.md
662664
- name: Text Translation
663665
href: cognitive-search-skill-text-translation.md
664-
- name: AI Vision multimodal embeddings
665-
href: cognitive-search-skill-vision-vectorize.md
666-
- name: Document Intelligence Layout skill
667-
href: cognitive-search-skill-document-intelligence-layout.md
666+
- name: AI Vision multimodal embeddings (preview)
667+
href: cognitive-search-skill-vision-vectorize.md
668668
- name: Azure AI Search utility skills (nonbillable)
669669
items:
670670
- name: Conditional
@@ -679,7 +679,7 @@ items:
679679
href: cognitive-search-skill-textsplit.md
680680
- name: Azure OpenAI skills
681681
items:
682-
- name: Azure OpenAI Embedding
682+
- name: Azure OpenAI Embedding (preview)
683683
href: cognitive-search-skill-azure-openai-embedding.md
684684
- name: Custom skills
685685
items:

0 commit comments

Comments
 (0)