Merge pull request #1365 from HeidiSteen/heidist-ignite

denrea · web-flow · commit 468475bde648 · 2024-11-07T13:56:40.000-08:00
[release-azure-search] Doc layout skill edit pass for standardization and consistency
diff --git a/articles/search/cognitive-search-skill-document-intelligence-layout.md b/articles/search/cognitive-search-skill-document-intelligence-layout.md
@@ -17,7 +17,7 @@ ms.date: 11/19/2024
 
 [!INCLUDE [Feature preview](./includes/previews/preview-generic.md)]
 
-The **Document Layout** skill analyzes a document to extract regions of interest and their inter-relationships to produce a syntactical representation (markdown format). This skill uses the [Document Intelligence layout model](/azure/ai-services/document-intelligence/concept-layout) provided in [Azure AI Document Intelligence](/azure/ai-services/document-intelligence/overview). 
+The **Document Layout** skill analyzes a document to extract regions of interest and their inter-relationships to produce a syntactical representation of the document in Markdown format. This skill uses the [Document Intelligence layout model](/azure/ai-services/document-intelligence/concept-layout) provided in [Azure AI Document Intelligence](/azure/ai-services/document-intelligence/overview). 
 
 This article is the reference documentation for the Document Layout skill.
 
@@ -52,9 +52,9 @@ Microsoft.Skills.Util.DocumentIntelligenceLayoutSkill
 ## Data limits
 
 + For PDF and TIFF, up to 2,000 pages can be processed (with a free tier subscription, only the first two pages are processed).
-+ Even if the file size for analyzing documents is 500 MB for [Azure AI Document Intelligence paid (S0) tier](https://azure.microsoft.com/pricing/details/cognitive-services/) and 4 MB for [Azure AI Document Intelligence free (F0) tier](https://azure.microsoft.com/pricing/details/cognitive-services/), to use this functionality you need to take into consideration the file limits allowed by your indexer based on the [indexer limits](search-limits-quotas-capacity.md#indexer-limits) associated to your Azure AI Search service tier. 
-+ Image dimensions must be between 50 pixels x 50 pixels and 10,000 pixels x 10,000 pixels.
-+ If your PDFs are password-locked, you must remove the lock before submission.
++ Even if the file size for analyzing documents is 500 MB for [Azure AI Document Intelligence paid (S0) tier](https://azure.microsoft.com/pricing/details/cognitive-services/) and 4 MB for [Azure AI Document Intelligence free (F0) tier](https://azure.microsoft.com/pricing/details/cognitive-services/), indexing is subject to the [indexer limits](search-limits-quotas-capacity.md#indexer-limits) of your search service tier.
++ Image dimensions must be between 50 pixels x 50 pixels or 10,000 pixels x 10,000 pixels.
++ If your PDFs are password-locked, remove the lock before running the indexer.
 
 ## Skill parameters
 
@@ -63,7 +63,7 @@ Parameters are case-sensitive.
 | Parameter name     | Allowed Values | Description |
 |--------------------|-------------|-------------|
 | `outputMode`    | `oneToMany` | Controls the cardinality of the output produced by the skill. |
-| `markdownHeaderDepth` |`h1`, `h2`, `h3`, `h4`, `h5`, `h6(default)` | This parameter describes the deepest nesting level that should be considered. For instance, if the markdownHeaderDepth is indicated as “h3” any markdown section that’s deeper than h3 (that is, #### and deeper) is considered as "content" that needs to be added to whatever level its parent is at. |
+| `markdownHeaderDepth` |`h1`, `h2`, `h3`, `h4`, `h5`, `h6(default)` | This parameter describes the deepest nesting level that should be considered. For instance, if the markdownHeaderDepth is indicated as "h3" any markdown section that’s deeper than h3 (that is, #### and deeper) is considered as "content" that needs to be added to whatever level its parent is at. |
 
 ## Skill inputs
 
@@ -90,17 +90,17 @@ Alternatively, it can be defined as:
 }
 ```
 
-The file reference object can be generated one of following ways:
+The file reference object can be generated in one of following ways:
 
-+ Setting the `allowSkillsetToReadFileData` parameter on your indexer definition to "true." This setting creates a path `/document/file_data` that is an object representing the original file data downloaded from your blob data source. This parameter only applies to files in Blob storage.
++ Setting the `allowSkillsetToReadFileData` parameter on your indexer definition to true. This setting creates a path `/document/file_data` that's an object representing the original file data downloaded from your blob data source. This parameter only applies to files in Azure Blob storage.
 
-+ Having a custom skill return a json object defined EXACTLY as above. The `$type` parameter must be set to exactly `file` and the `data` parameter must be the base 64 encoded byte array data of the file content, or the `url` parameter must be a correctly formatted URL with access to download the file at that location.
++ Having a custom skill returning a JSON object defined that provides `$type`, `data`, or `url` and `sastoken`. The `$type` parameter must be set to `file`, and  `data` must be the base 64-encoded byte array of the file content. The `url` parameter must be a valid URL with access for downloading the file at that location.
 
 ## Skill outputs
 
 | Output name      | Description                   |
 |---------------|-------------------------------|
-| `markdown_document`    | A collection of "sections" objects, which represent each individual section in the markdown document.|
+| `markdown_document`    | A collection of "sections" objects, which represent each individual section in the Markdown document.|
 
 ## Sample definition
 
@@ -130,8 +130,6 @@ The file reference object can be generated one of following ways:
 }
 ```
 
-<a name="sample-output"></a>
-
 ## Sample output
 
 ```json
@@ -159,7 +157,7 @@ The file reference object can be generated one of following ways:
 }
 ```
 
-The value of the "deepestSection" parameter controls the number of keys in the 'sections' dictionary. In the example skill definition, since the deepestSection was specified as “h3”, there are three keys in the "sections" dictionary – h1, h2, h3. 
+The value of the `markdownHeaderDepth` controls the number of keys in the "sections" dictionary. In the example skill definition, since the `markdownHeaderDepth` is "h3", there are three keys in the "sections" dictionary: h1, h2, h3.
 
 ## See also
 
diff --git a/articles/search/search-how-to-semantic-chunking.md b/articles/search/search-how-to-semantic-chunking.md
@@ -1,54 +1,62 @@
 ---
-title: Semantic Chunking and vectorization
+title: Structure-aware chunking and vectorization
 titleSuffix: Azure AI Search
-description: Use skill set and index projection to do semantic chunking, vectorization, and write into a search index in Azure AI Search pipelines.
+description: Chunk text content by paragraph or semantically coherent fragment. You can then apply integrated vectorization to generate embeddings and send the results to a searchable index.
 author: rawan
 ms.author: rawan
 ms.service: azure-ai-search
 ms.topic: how-to
-ms.date: 10/12/2024
+ms.date: 11/19/2024
 ms.custom:
   - references_regions
 ---
 
-# Semantic chunking and vectorization using the Document Layout skill and index projections
+# Structure-aware chunking and vectorization in Azure AI Search
 
 [!INCLUDE [Feature preview](./includes/previews/preview-generic.md)]
 
-Text data chunking strategies play a key role in optimizing the RAG response and performance. Semantic chunking is to find semantically coherent fragments of a sentence representation. These fragments can then be processed independently and recombined as semantic representations without loss of information, interpretation, or semantic relevance. The inherent meaning of the text is used as a guide for the chunking process. Markdown is a structured and formatted markup language and a popular input for enabling semantic chunking in RAG (Retrieval-Augmented Generation)
+Text data chunking strategies play a key role in optimizing RAG responses and performance. By using the new Document Layout skill that's currently in preview, you can chunk content based on paragraphs or semantically coherent fragments of a sentence representation. These fragments can then be processed independently and recombined as semantic representations without loss of information, interpretation, or semantic relevance. The inherent meaning of the text is used as a guide for the chunking process. 
 
-The Document Layout skill offers a comprehensive solution for advanced content extraction and chunk functionality. With the Layout skill, you can easily extract document layout and content as markdown format and utilize markdown parsing mode to produce a set of document chunks
+The Document Layout skill uses Markdown syntax (headings and content) to articulate document structure in the search document. The searchable content obtained from your source document is plain text but you can add integrated vectorization to generate embeddings for any field.
 
-This article shows:
-+ How to use the Document Layout skill to extract markdown sections
-+ How to apply split skill to constrain chunk size within each markdown section 
-+ Generate embeddings for the content within those sections
-+ How to use index projections to compile and write them into a search index.
+In this article, learn how to:
+
+> [!div class="checklist"]
+> + Use the Document Layout skill to detect sections and output Markdown content
+> + Use the Text Split skill to constrain chunk size to each markdown section
+> + Generate embeddings for each chunk
+> + Use index projections to map embeddings to fields in a search index
 
 ## Prerequisites
 
-+ An [indexer-based indexing pipeline](search-indexer-overview.md).
-+ An index that accepts the output of the indexer pipeline.
-+ A [supported data source](search-indexer-overview.md#supported-data-sources) having content that you want to chunk. 
-+ A [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) that splits documents based on paragraph boundaries.
-+ An [Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md) that generates vector embeddings  
-+ An [index projection](search-how-to-define-index-projections.md) for one-to-many indexing
++ [An indexer-based indexing pipeline](search-indexer-overview.md) with an index that accepts the output.
++ [A supported data source](search-indexer-overview.md#supported-data-sources) having text content that you want to chunk.
++ [A skillset with Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) that splits documents based on paragraph boundaries.
++ [An Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md) that generates vector embeddings.
++ [An index projection](search-how-to-define-index-projections.md) for one-to-many indexing.
 
 ## Prepare data files
 
 The raw inputs must be in a [supported data source](search-indexer-overview.md#supported-data-sources) and the file needs to be a format which [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) supports.
 
-+ Supported file format: PDF, JPEG, JPG, PNG, BMP, TIFF, DOCX, XLSX,PPTX,HTML
++ Supported file formats include: PDF, JPEG, JPG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, HTML.
+
++ Supported indexers can be any indexer that can handle the supported file formats. These include [Blob indexers](search-howto-indexing-azure-blob-storage.md), [OneLake indexers](search-how-to-index-onelake-files.md), [File indexers](search-file-storage-integration.md).
+
++ Supported regions for this feature include: East US, West US2, West Europe, North Central US. Be sure to [check this list](search-region-support.md#azure-public-regions) for updates on regional availability.
 
-You can use the Azure portal, REST APIs, or an Azure SDK to [create a data source](search-howto-indexing-azure-blob-storage.md).
+You can use the Azure portal, REST APIs, or an Azure SDK package to [create a data source](search-howto-indexing-azure-blob-storage.md).
+
+> [!TIP]
+> Upload the [health plan PDF](https://github.com/Azure-Samples/azure-search-sample-data/tree/main/health-plan) sample files to your supported data source to try out the Document Layout skill and structure-aware chunking on your own search service. The [Import and vectorize data](search-get-started-portal-import-vectors.md) wizard is an easy code-free approach for trying out this skill. Be sure to select the **default parsing mode** to use structure-aware chunking. Otherwise, the [Markdown parsing mode](search-how-to-index-markdown-blobs.md) is used instead.
 
 ## Create an index for one-to-many indexing
 
-Here's an example payload of a single designed around chunks. In this example, parent fields are the text_parent_id. Child fields are the vector and nonvector chunk of the markdown section.
+Here's an example payload of a single search document designed around chunks. In this example, parent fields are the text_parent_id. Child fields are the vector and nonvector chunks of the markdown section.
 
 You can use the Azure portal, REST APIs, or an Azure SDK to [create an index](search-how-to-load-search-index.md).
 
-An index must exist on the search service before you create the skill set or run the indexer
+An index must exist on the search service before you create the skill set or run the indexer.
 
 ```json
 {
@@ -164,16 +172,18 @@ An index must exist on the search service before you create the skill set or run
 }
 ```
 
-## Define skill set for semantic chunking and vectorization
+## Define skill set for structure-aware chunking and vectorization
 
-You can use the REST APIs to [create or update a skill set](cognitive-search-defining-skillset.md).
+Because the Document Layout skill is in preview, you must use the [Create Skillset 2024-11-01-preview](/rest/api/searchservice/skillsets/create?view=rest-searchservice-2024-11-01-preview&preserve-view=true) REST API for this step.
 
 Here's an example skill set definition payload to project individual markdown sections chunks and their vector outputs as documents in the search index using the [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md) and [Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md)
 
-```json
+```https
+POST {endpoint}/skillsets?api-version=2024-11-01-preview
+
 {
   "name": "my_skillset",
-  "description": "A skillset for semantic chunking and vectorization with a indexprojection around markdown section",
+  "description": "A skillset for structure-aware chunking and vectorization with a index projection around markdown section",
   "skills": [
     {
       "@odata.type": "#Microsoft.Skills.Util.DocumentIntelligenceLayoutSkill",
@@ -288,11 +298,13 @@ Here's an example skill set definition payload to project individual markdown se
 ```
 
 ## Run the indexer
-Once you create a data source, indexes, and skill set, you're ready to [create and run the indexer](search-howto-create-indexers.md#run-the-indexer). This step puts the pipeline into execution.
+
+Once you create a data source, index, and skillset, you're ready to [create and run the indexer](search-howto-create-indexers.md#run-the-indexer). This step puts the pipeline into execution.
 
 When using the [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md), make sure to set the following parameters on the indexer definition:
-+ The `allowSkillsetToReadFileData` parameter should be set to "true."
-+ the `parsingMode` parameter should be set to "default."
+
++ The `allowSkillsetToReadFileData` parameter should be set to `true`.
++ the `parsingMode` parameter should be set to `default`.
 
 Here's an example payload
 
@@ -321,10 +333,13 @@ Here's an example payload
 ```
 
 ## Verify results
+
 You can query your search index after processing concludes to test your solution.
 
 To check the results, run a query against the index. Use [Search Explorer](search-explorer.md) as a search client, or any tool that sends HTTP requests. The following query selects fields that contain the output of markdown section nonvector content and its vector.
 
+For Search Explorer, you can copy just the JSON and paste it into the JSON view for query execution.
+
 ```http
 POST /indexes/[index name]/docs/search?api-version=[api-version]
 {
@@ -335,11 +350,11 @@ POST /indexes/[index name]/docs/search?api-version=[api-version]
 
 ## See also
 
++ [Create or update a skill set](cognitive-search-defining-skillset.md).
 + [Create a data source](search-howto-indexing-azure-blob-storage.md)
 + [Define an index projection](search-how-to-define-index-projections.md)
-+ [How to define a skill set](cognitive-search-defining-skillset.md)
 + [Document Layout skill](cognitive-search-skill-document-intelligence-layout.md)
++ [Text Split skill](cognitive-search-skill-textsplit.md)
 + [Azure OpenAI Embedding skill](cognitive-search-skill-azure-openai-embedding.md)
 + [Create indexer (REST)](/rest/api/searchservice/indexers/create)
 + [Search Explorer](search-explorer.md)
-
diff --git a/articles/search/toc.yml b/articles/search/toc.yml
@@ -323,7 +323,7 @@ items:
         href: cognitive-search-concept-image-scenarios.md
       - name: Cache (incremental) enrichment
         href: search-howto-incremental-index.md
-      - name: Semantic chunking and vectorization
+      - name: Structure-aware chunking and vectorization
         href: search-how-to-semantic-chunking.md        
       - name: Design tips
         href: cognitive-search-concept-troubleshooting.md
@@ -642,7 +642,9 @@ items:
     - name: Annotation reference language
       href: cognitive-search-skill-annotation-language.md
     - name: Azure AI resource skills
-      items: 
+      items:
+      - name: Document Layout skill (preview)
+        href: cognitive-search-skill-document-intelligence-layout.md  
       - name: Entity Linking (v3)
         href: cognitive-search-skill-entity-linking-v3.md
       - name: Entity Recognition (v3)
@@ -661,10 +663,8 @@ items:
         href: cognitive-search-skill-sentiment-v3.md
       - name: Text Translation
         href: cognitive-search-skill-text-translation.md
-      - name: AI Vision multimodal embeddings
-        href: cognitive-search-skill-vision-vectorize.md
-      - name: Document Intelligence Layout skill
-        href: cognitive-search-skill-document-intelligence-layout.md        
+      - name: AI Vision multimodal embeddings (preview)
+        href: cognitive-search-skill-vision-vectorize.md      
     - name: Azure AI Search utility skills (nonbillable)
       items:
       - name: Conditional
@@ -679,7 +679,7 @@ items:
         href: cognitive-search-skill-textsplit.md
     - name: Azure OpenAI skills
       items:
-      - name: Azure OpenAI Embedding
+      - name: Azure OpenAI Embedding (preview)
         href: cognitive-search-skill-azure-openai-embedding.md 
     - name: Custom skills
       items: