You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/search/search-how-to-semantic-chunking.md
+67-20Lines changed: 67 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,39 +1,47 @@
1
1
---
2
-
title: Structure-aware chunking and vectorization
2
+
title: Chunk and vectorize by document layout
3
3
titleSuffix: Azure AI Search
4
-
description: Chunk text content by paragraph or semantically coherent fragment. You can then apply integrated vectorization to generate embeddings and send the results to a searchable index.
4
+
description: Chunk textual content by headings and semantically coherent fragments, generate embeddings, and send the results to a searchable index.
5
5
author: rawan
6
6
ms.author: rawan
7
7
ms.service: azure-ai-search
8
8
ms.topic: how-to
9
-
ms.date: 11/19/2024
9
+
ms.date: 11/22/2024
10
10
ms.custom:
11
11
- references_regions
12
12
- ignite-2024
13
13
---
14
14
15
-
# Structure-aware chunking and vectorization in Azure AI Search
15
+
# Chunk and vectorize by document layout or structure
Text data chunking strategies play a key role in optimizing RAG responses and performance. By using the new Document Layout skill that's currently in preview, you can chunk content based on paragraphs or semantically coherent fragments of a sentence representation. These fragments can then be processed independently and recombined as semantic representations without loss of information, interpretation, or semantic relevance. The inherent meaning of the text is used as a guide for the chunking process.
19
+
Text data chunking strategies play a key role in optimizing RAG responses and performance. By using the new **Document Layout** skill that's currently in preview, you can chunk content based on document structure, capturing headings and chunking the content body based on semantic coherence, such as paragraphs and sentences. Chunks are processed independently. Because LLMs work with multiple chunks, when those chunks are of higher quality and semantically coherent, the overall relevance of the query is improved.
20
20
21
-
The Document Layout skill uses Markdown syntax (headings and content) to articulate document structure in the search document. The searchable content obtained from your source document is plain text but you can add integrated vectorization to generate embeddings for any field.
21
+
<!-- Text data chunking strategies play a key role in optimizing RAG responses and performance. By using the new **Document Layout** skill that's currently in preview, you can chunk content based on document structure, capturing headings and chunking the content body based on semantic coherence, such as paragraphs and sentences. Chunks are processed independently and recombined as semantic representations. The inherent meaning of the text is used as a guide for the chunking process. -->
22
+
23
+
The Document Layout skill calls the [layout model](/azure/ai-services/document-intelligence/prebuilt/layout) in Document Intelligence. The model articulates content structure in JSON using Markdown syntax (headings and content), with fields for headings and content stored in a search index on Azure AI Search. The searchable content produced from the Document Layout skill is plain text but you can apply integrated vectorization to generate embeddings for any field in your source documents, including images.
22
24
23
25
In this article, learn how to:
24
26
25
27
> [!div class="checklist"]
26
-
> + Use the Document Layout skill to detect sections and output Markdown content
28
+
> + Use the Document Layout skill to recognize document structure
27
29
> + Use the Text Split skill to constrain chunk size to each markdown section
28
30
> + Generate embeddings for each chunk
29
31
> + Use index projections to map embeddings to fields in a search index
30
32
33
+
For illustration purposes, this article uses the [sample health plan PDFs](https://github.com/Azure-Samples/azure-search-sample-data/tree/main/health-plan) uploaded to Azure Blob Storage and then indexed using the **Import and vectorize data wizard**.
34
+
31
35
## Prerequisites
32
36
33
-
+[An indexer-based indexing pipeline](search-indexer-overview.md) with an index that accepts the output.
37
+
+[An indexer-based indexing pipeline](search-indexer-overview.md) with an index that accepts the output. The index must have fields for receiving headings and content.
38
+
34
39
+[A supported data source](search-indexer-overview.md#supported-data-sources) having text content that you want to chunk.
40
+
35
41
+[A skillset with Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) that splits documents based on paragraph boundaries.
42
+
36
43
+[An Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md) that generates vector embeddings.
44
+
37
45
+[An index projection](search-how-to-define-index-projections.md) for one-to-many indexing.
38
46
39
47
## Prepare data files
@@ -42,7 +50,7 @@ The raw inputs must be in a [supported data source](search-indexer-overview.md#s
+ Supported indexers can be any indexer that can handle the supported file formats. These include [Blob indexers](search-howto-indexing-azure-blob-storage.md), [OneLake indexers](search-how-to-index-onelake-files.md), [File indexers](search-file-storage-integration.md).
53
+
+ Supported indexers can be any indexer that can handle the supported file formats. These indexers include [Blob indexers](search-howto-indexing-azure-blob-storage.md), [OneLake indexers](search-how-to-index-onelake-files.md), [File indexers](search-file-storage-integration.md).
46
54
47
55
+ Supported regions for this feature include: East US, West US2, West Europe, North Central US. Be sure to [check this list](search-region-support.md#azure-public-regions) for updates on regional availability.
48
56
@@ -53,11 +61,13 @@ You can use the Azure portal, REST APIs, or an Azure SDK package to [create a da
53
61
54
62
## Create an index for one-to-many indexing
55
63
56
-
Here's an example payload of a single search document designed around chunks. In this example, parent fields are the text_parent_id. Child fields are the vector and nonvector chunks of the markdown section.
64
+
Here's an example payload of a single search document designed around chunks. Whenever you're working with chunks, you need a chunk field and a parent field that identifies the origin of the chunk. In this example, parent fields are the text_parent_id. Child fields are the vector and nonvector chunks of the markdown section.
57
65
58
-
You can use the Azure portal, REST APIs, or an Azure SDK to [create an index](search-how-to-load-search-index.md).
66
+
The Document Layout skill outputs headings and content. In this example, `header_1` through `header_3` store document headings, as detected by the skill. Other content, such as paragraphs, is stored in `chunk`. The `text_vector` field is a vector representation of the chunk field content.
59
67
60
-
An index must exist on the search service before you create the skill set or run the indexer.
68
+
You can use the **Import and vectorize data** wizard in the Azure portal, REST APIs, or an Azure SDK to [create an index](search-how-to-load-search-index.md). The following index is very similar to what the wizard creates by default. You might have more fields if you add image vectorization.
69
+
70
+
If you aren't using the wizard, the index must exist on the search service before you create the skillset or run the indexer.
61
71
62
72
```json
63
73
{
@@ -173,11 +183,17 @@ An index must exist on the search service before you create the skill set or run
173
183
}
174
184
```
175
185
176
-
## Define skill set for structure-aware chunking and vectorization
186
+
## Define a skillset for structure-aware chunking and vectorization
187
+
188
+
Because the Document Layout skill is in preview, you must use the [Create Skillset 2024-11-01-preview](/rest/api/searchservice/skillsets/create?view=rest-searchservice-2024-11-01-preview&preserve-view=true) REST API for this step. You can also use the Azure portal.
189
+
190
+
This section shows an example of a skillset definition that projects individual markdown sections, chunks, and their vector equivalents as fields in the search index. It uses the [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) to detect headings and populate a content field based on semantically coherent paragraphs and sentences in the source document. It uses the [Text Split skill](cognitive-search-skill-textsplit.md) to split the Markdown content into chunks. It uses the [Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md) to vectorize chunks and any other field for which you want embeddings.
177
191
178
-
Because the Document Layout skill is in preview, you must use the [Create Skillset 2024-11-01-preview](/rest/api/searchservice/skillsets/create?view=rest-searchservice-2024-11-01-preview&preserve-view=true) REST API for this step.
192
+
Besides skills, the skillset includes `indexProjections` and `cognitiveServices`:
179
193
180
-
Here's an example skill set definition payload to project individual markdown sections chunks and their vector outputs as documents in the search index using the [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) and [Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md)
194
+
+`indexProjections` are used for indexes containing chunked documents. The projections specify how parent-child content is mapped to fields in a search index for one-to-many indexing. For more information, see [Define an index projection](search-how-to-define-index-projections.md).
195
+
196
+
+`cognitiveServices`[attaches an Azure AI multi-service account](cognitive-search-attach-cognitive-services.md) for billing purposes (the Document Layout skill is available through [pay-as-you pricing](https://azure.microsoft.com/pricing/details/ai-document-intelligence/)).
181
197
182
198
```https
183
199
POST {endpoint}/skillsets?api-version=2024-11-01-preview
@@ -298,7 +314,7 @@ POST {endpoint}/skillsets?api-version=2024-11-01-preview
298
314
299
315
```
300
316
301
-
## Run the indexer
317
+
## Configure and run the indexer
302
318
303
319
Once you create a data source, index, and skillset, you're ready to [create and run the indexer](search-howto-create-indexers.md#run-the-indexer). This step puts the pipeline into execution.
304
320
@@ -307,9 +323,13 @@ When using the [Document Layout skill](cognitive-search-skill-document-intellige
307
323
+ The `allowSkillsetToReadFileData` parameter should be set to `true`.
308
324
+ the `parsingMode` parameter should be set to `default`.
309
325
310
-
Here's an example payload
326
+
`outputFieldMappings` don't need to be set in this scenario because `indexProjections` handle the source field to search field associations. Index projections handle field associations for the Document Layout skill and also regular chunking with the split skill for imported and vectorized data workloads. Output field mappings are still necessary for transformations or complex data mappings with functions which apply in other cases. However, for n-chunks per document, index projections handle this functionality natively.
327
+
328
+
Here's an example of an indexer creation request.
329
+
330
+
```https
331
+
POST {endpoint}/indexers?api-version=2024-11-01-preview
311
332
312
-
```json
313
333
{
314
334
"name": "my_indexer",
315
335
"dataSourceName": "my_blob_datasource",
@@ -333,6 +353,8 @@ Here's an example payload
333
353
}
334
354
```
335
355
356
+
When you send the request to the search service, the indexer runs.
357
+
336
358
## Verify results
337
359
338
360
You can query your search index after processing concludes to test your solution.
@@ -344,16 +366,41 @@ For Search Explorer, you can copy just the JSON and paste it into the JSON view
344
366
```http
345
367
POST /indexes/[index name]/docs/search?api-version=[api-version]
If you used the health plan PDFs to test this skill, Search Explorer results for the example query should look similar to the results in the following screenshot.
389
+
390
+
+ The query is a [hybrid query](hybrid-search-how-to-query.md) over text and vectors, so you see a `@search.rerankerScore` and results are ranked by that score. `searchMode=all` means that *all* query terms must be considered for a match (the default is *any*).
391
+
392
+
+ The query uses semantic ranking, so you see `captions` (it also has `answers`, but those aren't shown in the screenshot). The results are the most semantically relevant to the query input, as determined by the [semantic ranker](semantic-search-overview.md).
393
+
394
+
+ The `select` statement (not shown in the screenshot) specifies the header fields that the Document Layout skill detects and populates. You can add more fields to the select clause to inspect the content of chunks, title, or any other human readable field.
395
+
396
+
:::image type="content" source="media/search-how-to-semantic-chunking/query-results-doc-layout.png" lightbox="media/search-how-to-semantic-chunking/query-results-doc-layout.png" alt-text="Screenshot of hybrid query results that include doc layout skill output fields.":::
397
+
352
398
## See also
353
399
354
400
+[Create or update a skill set](cognitive-search-defining-skillset.md).
355
401
+[Create a data source](search-howto-indexing-azure-blob-storage.md)
356
402
+[Define an index projection](search-how-to-define-index-projections.md)
403
+
+[Attach an Azure AI multi-service account](cognitive-search-attach-cognitive-services.md)
0 commit comments