Skip to content

Commit 2e2200f

Browse files
authored
Merge pull request #1698 from HeidiSteen/main
[azure search] Screenshot, examples, clarification on structure-aware chunking and d…
2 parents 2702d23 + 8a77465 commit 2e2200f

File tree

3 files changed

+69
-22
lines changed

3 files changed

+69
-22
lines changed
183 KB
Loading

articles/search/search-how-to-semantic-chunking.md

Lines changed: 67 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,47 @@
11
---
2-
title: Structure-aware chunking and vectorization
2+
title: Chunk and vectorize by document layout
33
titleSuffix: Azure AI Search
4-
description: Chunk text content by paragraph or semantically coherent fragment. You can then apply integrated vectorization to generate embeddings and send the results to a searchable index.
4+
description: Chunk textual content by headings and semantically coherent fragments, generate embeddings, and send the results to a searchable index.
55
author: rawan
66
ms.author: rawan
77
ms.service: azure-ai-search
88
ms.topic: how-to
9-
ms.date: 11/19/2024
9+
ms.date: 11/22/2024
1010
ms.custom:
1111
- references_regions
1212
- ignite-2024
1313
---
1414

15-
# Structure-aware chunking and vectorization in Azure AI Search
15+
# Chunk and vectorize by document layout or structure
1616

1717
[!INCLUDE [Feature preview](./includes/previews/preview-generic.md)]
1818

19-
Text data chunking strategies play a key role in optimizing RAG responses and performance. By using the new Document Layout skill that's currently in preview, you can chunk content based on paragraphs or semantically coherent fragments of a sentence representation. These fragments can then be processed independently and recombined as semantic representations without loss of information, interpretation, or semantic relevance. The inherent meaning of the text is used as a guide for the chunking process.
19+
Text data chunking strategies play a key role in optimizing RAG responses and performance. By using the new **Document Layout** skill that's currently in preview, you can chunk content based on document structure, capturing headings and chunking the content body based on semantic coherence, such as paragraphs and sentences. Chunks are processed independently. Because LLMs work with multiple chunks, when those chunks are of higher quality and semantically coherent, the overall relevance of the query is improved.
2020

21-
The Document Layout skill uses Markdown syntax (headings and content) to articulate document structure in the search document. The searchable content obtained from your source document is plain text but you can add integrated vectorization to generate embeddings for any field.
21+
<!-- Text data chunking strategies play a key role in optimizing RAG responses and performance. By using the new **Document Layout** skill that's currently in preview, you can chunk content based on document structure, capturing headings and chunking the content body based on semantic coherence, such as paragraphs and sentences. Chunks are processed independently and recombined as semantic representations. The inherent meaning of the text is used as a guide for the chunking process. -->
22+
23+
The Document Layout skill calls the [layout model](/azure/ai-services/document-intelligence/prebuilt/layout) in Document Intelligence. The model articulates content structure in JSON using Markdown syntax (headings and content), with fields for headings and content stored in a search index on Azure AI Search. The searchable content produced from the Document Layout skill is plain text but you can apply integrated vectorization to generate embeddings for any field in your source documents, including images.
2224

2325
In this article, learn how to:
2426

2527
> [!div class="checklist"]
26-
> + Use the Document Layout skill to detect sections and output Markdown content
28+
> + Use the Document Layout skill to recognize document structure
2729
> + Use the Text Split skill to constrain chunk size to each markdown section
2830
> + Generate embeddings for each chunk
2931
> + Use index projections to map embeddings to fields in a search index
3032
33+
For illustration purposes, this article uses the [sample health plan PDFs](https://github.com/Azure-Samples/azure-search-sample-data/tree/main/health-plan) uploaded to Azure Blob Storage and then indexed using the **Import and vectorize data wizard**.
34+
3135
## Prerequisites
3236

33-
+ [An indexer-based indexing pipeline](search-indexer-overview.md) with an index that accepts the output.
37+
+ [An indexer-based indexing pipeline](search-indexer-overview.md) with an index that accepts the output. The index must have fields for receiving headings and content.
38+
3439
+ [A supported data source](search-indexer-overview.md#supported-data-sources) having text content that you want to chunk.
40+
3541
+ [A skillset with Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) that splits documents based on paragraph boundaries.
42+
3643
+ [An Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md) that generates vector embeddings.
44+
3745
+ [An index projection](search-how-to-define-index-projections.md) for one-to-many indexing.
3846

3947
## Prepare data files
@@ -42,7 +50,7 @@ The raw inputs must be in a [supported data source](search-indexer-overview.md#s
4250

4351
+ Supported file formats include: PDF, JPEG, JPG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, HTML.
4452

45-
+ Supported indexers can be any indexer that can handle the supported file formats. These include [Blob indexers](search-howto-indexing-azure-blob-storage.md), [OneLake indexers](search-how-to-index-onelake-files.md), [File indexers](search-file-storage-integration.md).
53+
+ Supported indexers can be any indexer that can handle the supported file formats. These indexers include [Blob indexers](search-howto-indexing-azure-blob-storage.md), [OneLake indexers](search-how-to-index-onelake-files.md), [File indexers](search-file-storage-integration.md).
4654

4755
+ Supported regions for this feature include: East US, West US2, West Europe, North Central US. Be sure to [check this list](search-region-support.md#azure-public-regions) for updates on regional availability.
4856

@@ -53,11 +61,13 @@ You can use the Azure portal, REST APIs, or an Azure SDK package to [create a da
5361
5462
## Create an index for one-to-many indexing
5563

56-
Here's an example payload of a single search document designed around chunks. In this example, parent fields are the text_parent_id. Child fields are the vector and nonvector chunks of the markdown section.
64+
Here's an example payload of a single search document designed around chunks. Whenever you're working with chunks, you need a chunk field and a parent field that identifies the origin of the chunk. In this example, parent fields are the text_parent_id. Child fields are the vector and nonvector chunks of the markdown section.
5765

58-
You can use the Azure portal, REST APIs, or an Azure SDK to [create an index](search-how-to-load-search-index.md).
66+
The Document Layout skill outputs headings and content. In this example, `header_1` through `header_3` store document headings, as detected by the skill. Other content, such as paragraphs, is stored in `chunk`. The `text_vector` field is a vector representation of the chunk field content.
5967

60-
An index must exist on the search service before you create the skill set or run the indexer.
68+
You can use the **Import and vectorize data** wizard in the Azure portal, REST APIs, or an Azure SDK to [create an index](search-how-to-load-search-index.md). The following index is very similar to what the wizard creates by default. You might have more fields if you add image vectorization.
69+
70+
If you aren't using the wizard, the index must exist on the search service before you create the skillset or run the indexer.
6171

6272
```json
6373
{
@@ -173,11 +183,17 @@ An index must exist on the search service before you create the skill set or run
173183
}
174184
```
175185

176-
## Define skill set for structure-aware chunking and vectorization
186+
## Define a skillset for structure-aware chunking and vectorization
187+
188+
Because the Document Layout skill is in preview, you must use the [Create Skillset 2024-11-01-preview](/rest/api/searchservice/skillsets/create?view=rest-searchservice-2024-11-01-preview&preserve-view=true) REST API for this step. You can also use the Azure portal.
189+
190+
This section shows an example of a skillset definition that projects individual markdown sections, chunks, and their vector equivalents as fields in the search index. It uses the [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) to detect headings and populate a content field based on semantically coherent paragraphs and sentences in the source document. It uses the [Text Split skill](cognitive-search-skill-textsplit.md) to split the Markdown content into chunks. It uses the [Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md) to vectorize chunks and any other field for which you want embeddings.
177191

178-
Because the Document Layout skill is in preview, you must use the [Create Skillset 2024-11-01-preview](/rest/api/searchservice/skillsets/create?view=rest-searchservice-2024-11-01-preview&preserve-view=true) REST API for this step.
192+
Besides skills, the skillset includes `indexProjections` and `cognitiveServices`:
179193

180-
Here's an example skill set definition payload to project individual markdown sections chunks and their vector outputs as documents in the search index using the [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) and [Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md)
194+
+ `indexProjections` are used for indexes containing chunked documents. The projections specify how parent-child content is mapped to fields in a search index for one-to-many indexing. For more information, see [Define an index projection](search-how-to-define-index-projections.md).
195+
196+
+ `cognitiveServices` [attaches an Azure AI multi-service account](cognitive-search-attach-cognitive-services.md) for billing purposes (the Document Layout skill is available through [pay-as-you pricing](https://azure.microsoft.com/pricing/details/ai-document-intelligence/)).
181197

182198
```https
183199
POST {endpoint}/skillsets?api-version=2024-11-01-preview
@@ -298,7 +314,7 @@ POST {endpoint}/skillsets?api-version=2024-11-01-preview
298314
299315
```
300316

301-
## Run the indexer
317+
## Configure and run the indexer
302318

303319
Once you create a data source, index, and skillset, you're ready to [create and run the indexer](search-howto-create-indexers.md#run-the-indexer). This step puts the pipeline into execution.
304320

@@ -307,9 +323,13 @@ When using the [Document Layout skill](cognitive-search-skill-document-intellige
307323
+ The `allowSkillsetToReadFileData` parameter should be set to `true`.
308324
+ the `parsingMode` parameter should be set to `default`.
309325

310-
Here's an example payload
326+
`outputFieldMappings` don't need to be set in this scenario because `indexProjections` handle the source field to search field associations. Index projections handle field associations for the Document Layout skill and also regular chunking with the split skill for imported and vectorized data workloads. Output field mappings are still necessary for transformations or complex data mappings with functions which apply in other cases. However, for n-chunks per document, index projections handle this functionality natively.
327+
328+
Here's an example of an indexer creation request.
329+
330+
```https
331+
POST {endpoint}/indexers?api-version=2024-11-01-preview
311332
312-
```json
313333
{
314334
"name": "my_indexer",
315335
"dataSourceName": "my_blob_datasource",
@@ -333,6 +353,8 @@ Here's an example payload
333353
}
334354
```
335355

356+
When you send the request to the search service, the indexer runs.
357+
336358
## Verify results
337359

338360
You can query your search index after processing concludes to test your solution.
@@ -344,16 +366,41 @@ For Search Explorer, you can copy just the JSON and paste it into the JSON view
344366
```http
345367
POST /indexes/[index name]/docs/search?api-version=[api-version]
346368
{
347-
"search": "*",
348-
"select": "metadata_storage_path, markdown_section, vector"
369+
"search": "copay for in-network providers",
370+
"count": true,
371+
"searchMode": "all",
372+
"vectorQueries": [
373+
{
374+
"kind": "text",
375+
"text": "*",
376+
"fields": "text_vector,image_vector"
377+
}
378+
],
379+
"queryType": "semantic",
380+
"semanticConfiguration": "healthplan-doc-layout-test-semantic-configuration",
381+
"captions": "extractive",
382+
"answers": "extractive|count-3",
383+
"queryLanguage": "en-us",
384+
"select": "header_1, header_2, header_3"
349385
}
350386
```
351387

388+
If you used the health plan PDFs to test this skill, Search Explorer results for the example query should look similar to the results in the following screenshot.
389+
390+
+ The query is a [hybrid query](hybrid-search-how-to-query.md) over text and vectors, so you see a `@search.rerankerScore` and results are ranked by that score. `searchMode=all` means that *all* query terms must be considered for a match (the default is *any*).
391+
392+
+ The query uses semantic ranking, so you see `captions` (it also has `answers`, but those aren't shown in the screenshot). The results are the most semantically relevant to the query input, as determined by the [semantic ranker](semantic-search-overview.md).
393+
394+
+ The `select` statement (not shown in the screenshot) specifies the header fields that the Document Layout skill detects and populates. You can add more fields to the select clause to inspect the content of chunks, title, or any other human readable field.
395+
396+
:::image type="content" source="media/search-how-to-semantic-chunking/query-results-doc-layout.png" lightbox="media/search-how-to-semantic-chunking/query-results-doc-layout.png" alt-text="Screenshot of hybrid query results that include doc layout skill output fields.":::
397+
352398
## See also
353399

354400
+ [Create or update a skill set](cognitive-search-defining-skillset.md).
355401
+ [Create a data source](search-howto-indexing-azure-blob-storage.md)
356402
+ [Define an index projection](search-how-to-define-index-projections.md)
403+
+ [Attach an Azure AI multi-service account](cognitive-search-attach-cognitive-services.md)
357404
+ [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md)
358405
+ [Text Split skill](cognitive-search-skill-textsplit.md)
359406
+ [Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md)

articles/search/toc.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -321,8 +321,6 @@ items:
321321
href: cognitive-search-concept-image-scenarios.md
322322
- name: Cache (incremental) enrichment
323323
href: search-howto-incremental-index.md
324-
- name: Structure-aware chunking and vectorization
325-
href: search-how-to-semantic-chunking.md
326324
- name: Design tips
327325
href: cognitive-search-concept-troubleshooting.md
328326
- name: Custom skills
@@ -339,6 +337,8 @@ items:
339337
href: vector-search-how-to-create-index.md
340338
- name: Chunk documents
341339
href: vector-search-how-to-chunk-documents.md
340+
- name: Chunk and vectorize by document layout
341+
href: search-how-to-semantic-chunking.md
342342
- name: Generate embeddings
343343
href: vector-search-how-to-generate-embeddings.md
344344
- name: Use embedding models from Azure AI Studio

0 commit comments

Comments
 (0)