You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/search/search-how-to-semantic-chunking.md
+61-20Lines changed: 61 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,35 +1,39 @@
1
1
---
2
-
title: Structure-aware chunking and vectorization
2
+
title: Chunk and vectorize by document layout
3
3
titleSuffix: Azure AI Search
4
-
description: Chunk text content by paragraph or semantically coherent fragment. You can then apply integrated vectorization to generate embeddings and send the results to a searchable index.
4
+
description: Chunk textual content by headings and semantically coherent fragments, generate embeddings, and send the results to a searchable index.
5
5
author: rawan
6
6
ms.author: rawan
7
7
ms.service: azure-ai-search
8
8
ms.topic: how-to
9
-
ms.date: 11/19/2024
9
+
ms.date: 11/22/2024
10
10
ms.custom:
11
11
- references_regions
12
12
---
13
13
14
-
# Structure-aware chunking and vectorization in Azure AI Search
14
+
# Chunk and vectorize by document layout or structure
Text data chunking strategies play a key role in optimizing RAG responses and performance. By using the new Document Layout skill that's currently in preview, you can chunk content based on paragraphs or semantically coherent fragments of a sentence representation. These fragments can then be processed independently and recombined as semantic representations without loss of information, interpretation, or semantic relevance. The inherent meaning of the text is used as a guide for the chunking process.
18
+
Text data chunking strategies play a key role in optimizing RAG responses and performance. By using the new **Document Layout** skill that's currently in preview, you can chunk content based on document structure, capturing headings and chunking the content body based on semantic coherence, such as paragraphs and sentences. Chunks are processed independently. Because LLMs work with multiple chunks, when those chunks are of higher quality and semantically coherent, the overall relevance of the query is improved.
19
19
20
-
The Document Layout skill uses Markdown syntax (headings and content) to articulate document structure in the search document. The searchable content obtained from your source document is plain text but you can add integrated vectorization to generate embeddings for any field.
20
+
<!-- Text data chunking strategies play a key role in optimizing RAG responses and performance. By using the new **Document Layout** skill that's currently in preview, you can chunk content based on document structure, capturing headings and chunking the content body based on semantic coherence, such as paragraphs and sentences. Chunks are processed independently and recombined as semantic representations. The inherent meaning of the text is used as a guide for the chunking process. -->
21
+
22
+
The Document Layout skill calls the [layout model](/azure/ai-services/document-intelligence/prebuilt/layout) in Document Intelligence. The model articulates content structure in JSON using Markdown syntax (headings and content), with fields for headings and content stored in a search index on Azure AI Search. The searchable content produced from the Document Layout skill is plain text but you can apply integrated vectorization to generate embeddings for any field in your source documents, including images.
21
23
22
24
In this article, learn how to:
23
25
24
26
> [!div class="checklist"]
25
-
> + Use the Document Layout skill to detect sections and output Markdown content
27
+
> + Use the Document Layout skill to recognize document structure
26
28
> + Use the Text Split skill to constrain chunk size to each markdown section
27
29
> + Generate embeddings for each chunk
28
30
> + Use index projections to map embeddings to fields in a search index
29
31
32
+
This article uses the [sample health plan PDFs](https://github.com/Azure-Samples/azure-search-sample-data/tree/main/health-plan) uploaded to Azure Blob Storage and then indexed using the **Import and vectorize data** wizard.
33
+
30
34
## Prerequisites
31
35
32
-
+[An indexer-based indexing pipeline](search-indexer-overview.md) with an index that accepts the output.
36
+
+[An indexer-based indexing pipeline](search-indexer-overview.md) with an index that accepts the output. The index must have fields for receiving headings and content.
33
37
+[A supported data source](search-indexer-overview.md#supported-data-sources) having text content that you want to chunk.
34
38
+[A skillset with Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) that splits documents based on paragraph boundaries.
35
39
+[An Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md) that generates vector embeddings.
@@ -41,7 +45,7 @@ The raw inputs must be in a [supported data source](search-indexer-overview.md#s
+ Supported indexers can be any indexer that can handle the supported file formats. These include [Blob indexers](search-howto-indexing-azure-blob-storage.md), [OneLake indexers](search-how-to-index-onelake-files.md), [File indexers](search-file-storage-integration.md).
48
+
+ Supported indexers can be any indexer that can handle the supported file formats. These indexers include [Blob indexers](search-howto-indexing-azure-blob-storage.md), [OneLake indexers](search-how-to-index-onelake-files.md), [File indexers](search-file-storage-integration.md).
45
49
46
50
+ Supported regions for this feature include: East US, West US2, West Europe, North Central US. Be sure to [check this list](search-region-support.md#azure-public-regions) for updates on regional availability.
47
51
@@ -52,11 +56,13 @@ You can use the Azure portal, REST APIs, or an Azure SDK package to [create a da
52
56
53
57
## Create an index for one-to-many indexing
54
58
55
-
Here's an example payload of a single search document designed around chunks. In this example, parent fields are the text_parent_id. Child fields are the vector and nonvector chunks of the markdown section.
59
+
Here's an example payload of a single search document designed around chunks. Whenever you're working with chunks, you need a chunk field and a parent field that identifies the origin of the chunk. In this example, parent fields are the text_parent_id. Child fields are the vector and nonvector chunks of the markdown section.
60
+
61
+
The Document Layout skill outputs headings and content. In this example, `header_1` through `header_3` store document headings, as detected by the skill. Other content, such as paragraphs, is stored in `chunk`. The `text_vector` field is a vector representation of the chunk field content.
56
62
57
-
You can use the Azure portal, REST APIs, or an Azure SDK to [create an index](search-how-to-load-search-index.md).
63
+
You can use the **Import and vectorize data** wizard in the Azure portal, REST APIs, or an Azure SDK to [create an index](search-how-to-load-search-index.md). The following index is very similar to what the wizard creates by default. You might have more fields if you add image vectorization.
58
64
59
-
An index must exist on the search service before you create the skill set or run the indexer.
65
+
If you aren't using the wizard, the index must exist on the search service before you create the skillset or run the indexer.
60
66
61
67
```json
62
68
{
@@ -172,11 +178,17 @@ An index must exist on the search service before you create the skill set or run
172
178
}
173
179
```
174
180
175
-
## Define skill set for structure-aware chunking and vectorization
181
+
## Define a skillset for structure-aware chunking and vectorization
182
+
183
+
Because the Document Layout skill is in preview, you must use the [Create Skillset 2024-11-01-preview](/rest/api/searchservice/skillsets/create?view=rest-searchservice-2024-11-01-preview&preserve-view=true) REST API for this step. You can also use the Azure portal.
184
+
185
+
This section shows an example of a skillset definition that projects individual markdown sections, chunks, and their vector equivalents as fields in the search index. It uses the [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) to detect headings and populate a content field based on semantically coherent paragraphs and sentences in the source document. It uses the [Text Split skill](cognitive-search-skill-textsplit.md) to split the Markdown content into chunks. It uses the [Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md) to vectorize chunks and any other field for which you want embeddings.
176
186
177
-
Because the Document Layout skill is in preview, you must use the [Create Skillset 2024-11-01-preview](/rest/api/searchservice/skillsets/create?view=rest-searchservice-2024-11-01-preview&preserve-view=true) REST API for this step.
187
+
Besides skills, the skillset includes `indexProjections` and `cognitiveServices`:
178
188
179
-
Here's an example skill set definition payload to project individual markdown sections chunks and their vector outputs as documents in the search index using the [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) and [Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md)
189
+
+`indexProjections` are used for indexes containing chunked documents. The projections specify how parent-child content is mapped to fields in a search index for one-to-many indexing. For more information, see [Define an index projection](search-how-to-define-index-projections.md).
190
+
191
+
+`cognitiveServices`[attaches an Azure AI multi-service account](cognitive-search-attach-cognitive-services.md) for billing purposes (the Document Layout skill is available through [pay-as-you pricing](https://azure.microsoft.com/pricing/details/ai-document-intelligence/)).
180
192
181
193
```https
182
194
POST {endpoint}/skillsets?api-version=2024-11-01-preview
@@ -297,7 +309,7 @@ POST {endpoint}/skillsets?api-version=2024-11-01-preview
297
309
298
310
```
299
311
300
-
## Run the indexer
312
+
## Configure and run the indexer
301
313
302
314
Once you create a data source, index, and skillset, you're ready to [create and run the indexer](search-howto-create-indexers.md#run-the-indexer). This step puts the pipeline into execution.
303
315
@@ -306,9 +318,13 @@ When using the [Document Layout skill](cognitive-search-skill-document-intellige
306
318
+ The `allowSkillsetToReadFileData` parameter should be set to `true`.
307
319
+ the `parsingMode` parameter should be set to `default`.
308
320
309
-
Here's an example payload
321
+
`outputFieldMappings` don't need to be set in this scenario because `indexProjections` handle the source field to search field associations. Index projections handle field associations for the Document Layout skill and also regular chunking with the split skill for imported and vectorized data workloads. Output field mappings are still necessary for transformations or complex data mappings with functions which apply in other cases. However, for n-chunks per document, index projections handle this functionality natively.
322
+
323
+
Here's an example of an indexer creation request.
324
+
325
+
```https
326
+
POST {endpoint}/indexers?api-version=2024-11-01-preview
310
327
311
-
```json
312
328
{
313
329
"name": "my_indexer",
314
330
"dataSourceName": "my_blob_datasource",
@@ -332,6 +348,8 @@ Here's an example payload
332
348
}
333
349
```
334
350
351
+
When you send the request to the search service, the indexer runs.
352
+
335
353
## Verify results
336
354
337
355
You can query your search index after processing concludes to test your solution.
@@ -343,16 +361,39 @@ For Search Explorer, you can copy just the JSON and paste it into the JSON view
343
361
```http
344
362
POST /indexes/[index name]/docs/search?api-version=[api-version]
If you used the health plan PDFs to test this skill, search results for the example query should look similar to this example.
384
+
385
+
+ It uses semantic ranking, so you see `captions` (it also has answers, but those aren't shown in the screenshot). The first result is also semantically relevant to the query string.
386
+
+ It's a [hybrid query](hybrid-search-how-to-query.md) over text and vectors, so you see a `@search.rerankerScore` and results are ranked by that score.
387
+
+ The `select` statement specifies the header fields that the Document Layout skill detected and populated.
388
+
389
+
:::image type="content" source="media/search-how-to-semantic-chunking/query-results-doc-layout.png" alt-text="Screenshot of query results that include doc layout skill output fields.":::
390
+
351
391
## See also
352
392
353
393
+[Create or update a skill set](cognitive-search-defining-skillset.md).
354
394
+[Create a data source](search-howto-indexing-azure-blob-storage.md)
355
395
+[Define an index projection](search-how-to-define-index-projections.md)
396
+
+[Attach an Azure AI multi-service account](cognitive-search-attach-cognitive-services.md)
0 commit comments