Skip to content

Commit 6cfcaea

Browse files
committed
Screenshot, examples, clarification on structure-aware chunking and doc layout
1 parent d5d892d commit 6cfcaea

File tree

3 files changed

+63
-22
lines changed

3 files changed

+63
-22
lines changed
204 KB
Loading

articles/search/search-how-to-semantic-chunking.md

Lines changed: 61 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,39 @@
11
---
2-
title: Structure-aware chunking and vectorization
2+
title: Chunk and vectorize by document layout
33
titleSuffix: Azure AI Search
4-
description: Chunk text content by paragraph or semantically coherent fragment. You can then apply integrated vectorization to generate embeddings and send the results to a searchable index.
4+
description: Chunk textual content by headings and semantically coherent fragments, generate embeddings, and send the results to a searchable index.
55
author: rawan
66
ms.author: rawan
77
ms.service: azure-ai-search
88
ms.topic: how-to
9-
ms.date: 11/19/2024
9+
ms.date: 11/22/2024
1010
ms.custom:
1111
- references_regions
1212
---
1313

14-
# Structure-aware chunking and vectorization in Azure AI Search
14+
# Chunk and vectorize by document layout or structure
1515

1616
[!INCLUDE [Feature preview](./includes/previews/preview-generic.md)]
1717

18-
Text data chunking strategies play a key role in optimizing RAG responses and performance. By using the new Document Layout skill that's currently in preview, you can chunk content based on paragraphs or semantically coherent fragments of a sentence representation. These fragments can then be processed independently and recombined as semantic representations without loss of information, interpretation, or semantic relevance. The inherent meaning of the text is used as a guide for the chunking process.
18+
Text data chunking strategies play a key role in optimizing RAG responses and performance. By using the new **Document Layout** skill that's currently in preview, you can chunk content based on document structure, capturing headings and chunking the content body based on semantic coherence, such as paragraphs and sentences. Chunks are processed independently. Because LLMs work with multiple chunks, when those chunks are of higher quality and semantically coherent, the overall relevance of the query is improved.
1919

20-
The Document Layout skill uses Markdown syntax (headings and content) to articulate document structure in the search document. The searchable content obtained from your source document is plain text but you can add integrated vectorization to generate embeddings for any field.
20+
<!-- Text data chunking strategies play a key role in optimizing RAG responses and performance. By using the new **Document Layout** skill that's currently in preview, you can chunk content based on document structure, capturing headings and chunking the content body based on semantic coherence, such as paragraphs and sentences. Chunks are processed independently and recombined as semantic representations. The inherent meaning of the text is used as a guide for the chunking process. -->
21+
22+
The Document Layout skill calls the [layout model](/azure/ai-services/document-intelligence/prebuilt/layout) in Document Intelligence. The model articulates content structure in JSON using Markdown syntax (headings and content), with fields for headings and content stored in a search index on Azure AI Search. The searchable content produced from the Document Layout skill is plain text but you can apply integrated vectorization to generate embeddings for any field in your source documents, including images.
2123

2224
In this article, learn how to:
2325

2426
> [!div class="checklist"]
25-
> + Use the Document Layout skill to detect sections and output Markdown content
27+
> + Use the Document Layout skill to recognize document structure
2628
> + Use the Text Split skill to constrain chunk size to each markdown section
2729
> + Generate embeddings for each chunk
2830
> + Use index projections to map embeddings to fields in a search index
2931
32+
This article uses the [sample health plan PDFs](https://github.com/Azure-Samples/azure-search-sample-data/tree/main/health-plan) uploaded to Azure Blob Storage and then indexed using the **Import and vectorize data** wizard.
33+
3034
## Prerequisites
3135

32-
+ [An indexer-based indexing pipeline](search-indexer-overview.md) with an index that accepts the output.
36+
+ [An indexer-based indexing pipeline](search-indexer-overview.md) with an index that accepts the output. The index must have fields for receiving headings and content.
3337
+ [A supported data source](search-indexer-overview.md#supported-data-sources) having text content that you want to chunk.
3438
+ [A skillset with Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) that splits documents based on paragraph boundaries.
3539
+ [An Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md) that generates vector embeddings.
@@ -41,7 +45,7 @@ The raw inputs must be in a [supported data source](search-indexer-overview.md#s
4145

4246
+ Supported file formats include: PDF, JPEG, JPG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, HTML.
4347

44-
+ Supported indexers can be any indexer that can handle the supported file formats. These include [Blob indexers](search-howto-indexing-azure-blob-storage.md), [OneLake indexers](search-how-to-index-onelake-files.md), [File indexers](search-file-storage-integration.md).
48+
+ Supported indexers can be any indexer that can handle the supported file formats. These indexers include [Blob indexers](search-howto-indexing-azure-blob-storage.md), [OneLake indexers](search-how-to-index-onelake-files.md), [File indexers](search-file-storage-integration.md).
4549

4650
+ Supported regions for this feature include: East US, West US2, West Europe, North Central US. Be sure to [check this list](search-region-support.md#azure-public-regions) for updates on regional availability.
4751

@@ -52,11 +56,13 @@ You can use the Azure portal, REST APIs, or an Azure SDK package to [create a da
5256
5357
## Create an index for one-to-many indexing
5458

55-
Here's an example payload of a single search document designed around chunks. In this example, parent fields are the text_parent_id. Child fields are the vector and nonvector chunks of the markdown section.
59+
Here's an example payload of a single search document designed around chunks. Whenever you're working with chunks, you need a chunk field and a parent field that identifies the origin of the chunk. In this example, parent fields are the text_parent_id. Child fields are the vector and nonvector chunks of the markdown section.
60+
61+
The Document Layout skill outputs headings and content. In this example, `header_1` through `header_3` store document headings, as detected by the skill. Other content, such as paragraphs, is stored in `chunk`. The `text_vector` field is a vector representation of the chunk field content.
5662

57-
You can use the Azure portal, REST APIs, or an Azure SDK to [create an index](search-how-to-load-search-index.md).
63+
You can use the **Import and vectorize data** wizard in the Azure portal, REST APIs, or an Azure SDK to [create an index](search-how-to-load-search-index.md). The following index is very similar to what the wizard creates by default. You might have more fields if you add image vectorization.
5864

59-
An index must exist on the search service before you create the skill set or run the indexer.
65+
If you aren't using the wizard, the index must exist on the search service before you create the skillset or run the indexer.
6066

6167
```json
6268
{
@@ -172,11 +178,17 @@ An index must exist on the search service before you create the skill set or run
172178
}
173179
```
174180

175-
## Define skill set for structure-aware chunking and vectorization
181+
## Define a skillset for structure-aware chunking and vectorization
182+
183+
Because the Document Layout skill is in preview, you must use the [Create Skillset 2024-11-01-preview](/rest/api/searchservice/skillsets/create?view=rest-searchservice-2024-11-01-preview&preserve-view=true) REST API for this step. You can also use the Azure portal.
184+
185+
This section shows an example of a skillset definition that projects individual markdown sections, chunks, and their vector equivalents as fields in the search index. It uses the [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) to detect headings and populate a content field based on semantically coherent paragraphs and sentences in the source document. It uses the [Text Split skill](cognitive-search-skill-textsplit.md) to split the Markdown content into chunks. It uses the [Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md) to vectorize chunks and any other field for which you want embeddings.
176186

177-
Because the Document Layout skill is in preview, you must use the [Create Skillset 2024-11-01-preview](/rest/api/searchservice/skillsets/create?view=rest-searchservice-2024-11-01-preview&preserve-view=true) REST API for this step.
187+
Besides skills, the skillset includes `indexProjections` and `cognitiveServices`:
178188

179-
Here's an example skill set definition payload to project individual markdown sections chunks and their vector outputs as documents in the search index using the [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) and [Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md)
189+
+ `indexProjections` are used for indexes containing chunked documents. The projections specify how parent-child content is mapped to fields in a search index for one-to-many indexing. For more information, see [Define an index projection](search-how-to-define-index-projections.md).
190+
191+
+ `cognitiveServices` [attaches an Azure AI multi-service account](cognitive-search-attach-cognitive-services.md) for billing purposes (the Document Layout skill is available through [pay-as-you pricing](https://azure.microsoft.com/pricing/details/ai-document-intelligence/)).
180192

181193
```https
182194
POST {endpoint}/skillsets?api-version=2024-11-01-preview
@@ -297,7 +309,7 @@ POST {endpoint}/skillsets?api-version=2024-11-01-preview
297309
298310
```
299311

300-
## Run the indexer
312+
## Configure and run the indexer
301313

302314
Once you create a data source, index, and skillset, you're ready to [create and run the indexer](search-howto-create-indexers.md#run-the-indexer). This step puts the pipeline into execution.
303315

@@ -306,9 +318,13 @@ When using the [Document Layout skill](cognitive-search-skill-document-intellige
306318
+ The `allowSkillsetToReadFileData` parameter should be set to `true`.
307319
+ the `parsingMode` parameter should be set to `default`.
308320

309-
Here's an example payload
321+
`outputFieldMappings` don't need to be set in this scenario because `indexProjections` handle the source field to search field associations. Index projections handle field associations for the Document Layout skill and also regular chunking with the split skill for imported and vectorized data workloads. Output field mappings are still necessary for transformations or complex data mappings with functions which apply in other cases. However, for n-chunks per document, index projections handle this functionality natively.
322+
323+
Here's an example of an indexer creation request.
324+
325+
```https
326+
POST {endpoint}/indexers?api-version=2024-11-01-preview
310327
311-
```json
312328
{
313329
"name": "my_indexer",
314330
"dataSourceName": "my_blob_datasource",
@@ -332,6 +348,8 @@ Here's an example payload
332348
}
333349
```
334350

351+
When you send the request to the search service, the indexer runs.
352+
335353
## Verify results
336354

337355
You can query your search index after processing concludes to test your solution.
@@ -343,16 +361,39 @@ For Search Explorer, you can copy just the JSON and paste it into the JSON view
343361
```http
344362
POST /indexes/[index name]/docs/search?api-version=[api-version]
345363
{
346-
"search": "*",
347-
"select": "metadata_storage_path, markdown_section, vector"
364+
"search": "copay for in-network providers",
365+
"count": true,
366+
"searchMode": "all",
367+
"vectorQueries": [
368+
{
369+
"kind": "text",
370+
"text": "*",
371+
"fields": "text_vector,image_vector"
372+
}
373+
],
374+
"queryType": "semantic",
375+
"semanticConfiguration": "healthplan-doc-layout-test-semantic-configuration",
376+
"captions": "extractive",
377+
"answers": "extractive|count-3",
378+
"queryLanguage": "en-us",
379+
"select": "header_1, header_2, header_3"
348380
}
349381
```
350382

383+
If you used the health plan PDFs to test this skill, search results for the example query should look similar to this example.
384+
385+
+ It uses semantic ranking, so you see `captions` (it also has answers, but those aren't shown in the screenshot). The first result is also semantically relevant to the query string.
386+
+ It's a [hybrid query](hybrid-search-how-to-query.md) over text and vectors, so you see a `@search.rerankerScore` and results are ranked by that score.
387+
+ The `select` statement specifies the header fields that the Document Layout skill detected and populated.
388+
389+
:::image type="content" source="media/search-how-to-semantic-chunking/query-results-doc-layout.png" alt-text="Screenshot of query results that include doc layout skill output fields.":::
390+
351391
## See also
352392

353393
+ [Create or update a skill set](cognitive-search-defining-skillset.md).
354394
+ [Create a data source](search-howto-indexing-azure-blob-storage.md)
355395
+ [Define an index projection](search-how-to-define-index-projections.md)
396+
+ [Attach an Azure AI multi-service account](cognitive-search-attach-cognitive-services.md)
356397
+ [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md)
357398
+ [Text Split skill](cognitive-search-skill-textsplit.md)
358399
+ [Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md)

articles/search/toc.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -321,8 +321,6 @@ items:
321321
href: cognitive-search-concept-image-scenarios.md
322322
- name: Cache (incremental) enrichment
323323
href: search-howto-incremental-index.md
324-
- name: Structure-aware chunking and vectorization
325-
href: search-how-to-semantic-chunking.md
326324
- name: Design tips
327325
href: cognitive-search-concept-troubleshooting.md
328326
- name: Custom skills
@@ -339,6 +337,8 @@ items:
339337
href: vector-search-how-to-create-index.md
340338
- name: Chunk documents
341339
href: vector-search-how-to-chunk-documents.md
340+
- name: Chunk and vectorize by document layout
341+
href: search-how-to-semantic-chunking.md
342342
- name: Generate embeddings
343343
href: vector-search-how-to-generate-embeddings.md
344344
- name: Use embedding models from Azure AI Studio

0 commit comments

Comments
 (0)