Skip to content

Commit 05b947f

Browse files
authored
Merge pull request #5233 from MicrosoftDocs/main
Publish to live, Wednesday 4AM PST, 5/28
2 parents 51eec00 + 3043676 commit 05b947f

9 files changed

+92
-53
lines changed

articles/search/cognitive-search-concept-annotations-syntax.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -8,24 +8,24 @@ ms.service: azure-ai-search
88
ms.custom:
99
- ignite-2023
1010
ms.topic: how-to
11-
ms.date: 12/10/2024
11+
ms.date: 05/27/2025
1212
---
1313

1414
# Reference a path to enriched nodes using context and source properties an Azure AI Search skillset
1515

1616
During skillset execution, the engine builds an in-memory [enrichment tree](cognitive-search-working-with-skillsets.md#enrichment-tree) that captures each enrichment, such as recognized entities or translated text. In this article, learn how to reference an enrichment node in the enrichment tree so that you can pass output to downstream skills or specify an output field mapping for a search index field.
1717

18-
This article uses examples to illustrate various scenarios. For the full syntax, see [Skill context and input annotation language language](cognitive-search-skill-annotation-language.md).
18+
This article uses examples to illustrate various scenarios. For the full syntax, see [Skill context and input annotation language](cognitive-search-skill-annotation-language.md).
1919

2020
## Background concepts
2121

2222
Before reviewing the syntax, let's revisit a few important concepts to better understand the examples provided later in this article.
2323

2424
| Term | Description |
2525
|------|-------------|
26-
| "enriched document" | An enriched document is an in-memory structure that collects skill output as it's created and it holds all enrichments related to a document. Think of an enriched document as a tree. Generally, the tree starts at the root document level, and each new enrichment is created from a previous as its child. |
27-
| "node" | Within an enriched document, a node (sometimes referred to as an "annotation") is created and populated by a skill, such as "text" and "layoutText" in the OCR skill. An enriched document is populated with both enrichments and original source field values or metadata copied from the source. |
28-
| "context" | The scope of enrichment, which is either the entire document, a portion of a document, or if you're working with images, the extracted images from a document. By default, the enrichment context is at the `"/document"` level, scoped to individual documents contained in the data source. When a skill runs, the outputs of that skill become [properties of the defined context](#example-2). |
26+
| "enriched document" | An enriched document is an in-memory structure that collects skill output as it's created and it holds all enrichments related to a document. Think of an enriched document as a tree. Generally, the tree starts at the root document level, and each new enrichment is created from a previous node as its child. |
27+
| "node" | Within an enriched document, a node (sometimes referred to as an "annotation") is specific output such as the "text" or "layoutText" of the OCR skill, or an original source field value such as the content of a product ID field, or metadata copied from the source such as metadata_storage_path from blobs in Azure Storage. |
28+
| "context" | The scope of enrichment, which is either the entire document, a portion of a document (pages or sentences), or if you're working with images, the extracted images from a document. By default, the enrichment context is at the `"/document"` level, scoped to individual documents contained in the data source. When a skill runs, the outputs of that skill become [properties of the defined context](#example-2). |
2929

3030
## Paths for different scenarios
3131

@@ -37,7 +37,7 @@ The example in the screenshot illustrates the path for an item in an Azure Cosmo
3737

3838
+ `context` path is `/document/HotelId` because the collection is partitioned into documents by the `/HotelId` field.
3939

40-
+ `source` path is `/document/Description` because the skill is a translation skill, and the field that you'll want the skill to translate is the `Description` field in each document.
40+
+ `source` path is `/document/Description` because the skill is a translation skill, and the field that you want to translate is the `Description` field in each document.
4141

4242
All paths start with `/document`. An enriched document is created in the "document cracking" stage of indexer execution, when the indexer opens a document or reads in a row from the data source. Initially, the only node in an enriched document is the [root node (`/document`)](cognitive-search-skill-annotation-language.md#document-root), and it's the node from which all other enrichments occur.
4343

@@ -47,7 +47,7 @@ The following list includes several common examples:
4747
+ `/document/{key}` is the syntax for a document or item in an Azure Cosmos DB collection, where `{key}` is the actual key, such as `/document/HotelId` in the previous example.
4848
+ `/document/content` specifies the "content" property of a JSON blob.
4949
+ `/document/{field}` is the syntax for an operation performed on a specific field, such as translating the `/document/Description` field, seen in the previous example.
50-
+ `/document/pages/*` or `/document/sentences/*` become the context if you're breaking a large document into smaller chunks for processing. If "context" is `/document/pages/*`, the skill executes once over each page in the document. Because there might be more than one page or sentence, you'll append `/*` to catch them all.
50+
+ `/document/pages/*` or `/document/sentences/*` become the context if you're breaking a large document into smaller chunks for processing. If "context" is `/document/pages/*`, the skill executes once over each page in the document. Because there might be more than one page or sentence, you can append `/*` to catch them all.
5151
+ `/document/normalized_images/*` is created during document cracking if the document contains images. All paths to images start with normalized_images. Since there are often multiple images embedded in a document, append `/*`.
5252

5353
Examples in the remainder of this article are based on the "content" field generated automatically by [Azure blob indexers](search-howto-indexing-azure-blob-storage.md) as part of the [document cracking](search-indexer-overview.md#document-cracking) phase. When referring to documents from a Blob container, use a format such as `"/document/content"`, where the "content" field is part of the "document".
@@ -56,7 +56,7 @@ Examples in the remainder of this article are based on the "content" field gener
5656

5757
## Example 1: Simple annotation reference
5858

59-
In Azure Blob Storage, suppose you have a variety of files containing references to people's names that you want to extract using entity recognition. In the following skill definition, `"/document/content"` is the textual representation of the entire document, and "people" is an extraction of full names for entities identified as persons.
59+
In Azure Blob Storage, suppose you have various files containing references to people's names that you want to extract using entity recognition. In the following skill definition, `"/document/content"` is the textual representation of the entire document, and "people" is an extraction of full names for entities identified as persons.
6060

6161
Because the default context is `"/document"`, the list of people can now be referenced as `"/document/people"`. In this specific case `"/document/people"` is an annotation, which could now be mapped to a field in an index, or used in another skill in the same skillset.
6262

@@ -110,15 +110,15 @@ To invoke the right number of iterations, set the context as `"/document/people/
110110
}
111111
```
112112

113-
When annotations are arrays or collections of strings, you might want to target specific members rather than the array as a whole. The above example generates an annotation called `"last"` under each node represented by the context. If you want to refer to this family of annotations, you could use the syntax `"/document/people/*/last"`. If you want to refer to a particular annotation, you could use an explicit index: `"/document/people/1/last`" to reference the last name of the first person identified in the document. Notice that in this syntax arrays are "0 indexed".
113+
When annotations are arrays or collections of strings, you might want to target specific members rather than the array as a whole. The previous example generates an annotation called `"last"` under each node represented by the context. If you want to refer to this family of annotations, you could use the syntax `"/document/people/*/last"`. If you want to refer to a particular annotation, you could use an explicit index: `"/document/people/1/last`" to reference the last name of the first person identified in the document. Notice that in this syntax arrays are "0 indexed".
114114

115115
<a name="example-3"></a>
116116

117117
## Example 3: Reference members within an array
118118

119119
Sometimes you need to group all annotations of a particular type to pass them to a particular skill. Consider a hypothetical custom skill that identifies the most common last name from all the last names extracted in Example 2. To provide just the last names to the custom skill, specify the context as `"/document"` and the input as `"/document/people/*/lastname"`.
120120

121-
Notice that the cardinality of `"/document/people/*/lastname"` is larger than that of document. There may be 10 lastname nodes while there's only one document node for this document. In that case, the system will automatically create an array of `"/document/people/*/lastname"` containing all of the elements in the document.
121+
Notice that the cardinality of `"/document/people/*/lastname"` is larger than that of document. There might be 10 lastname nodes while there's only one document node for this document. In that case, the system will automatically create an array of `"/document/people/*/lastname"` containing all of the elements in the document.
122122

123123
```json
124124
{

articles/search/cognitive-search-skill-document-extraction.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,23 @@ ms.service: azure-ai-search
1010
ms.custom:
1111
- ignite-2023
1212
ms.topic: reference
13-
ms.date: 12/12/2021
13+
ms.date: 05/27/2025
1414
---
15+
1516
# Document Extraction cognitive skill
1617

17-
The **Document Extraction** skill extracts content from a file within the enrichment pipeline. This allows you to take advantage of the document extraction step that normally happens before the skillset execution with files that may be generated by other skills.
18+
The **Document Extraction** skill extracts content from a file within the enrichment pipeline. By default, content extraction or retrieval is built into the indexer pipeline. However, by using the Document Extraction skill, you can control how parameters are set, and how extracted content is named in the enrichment tree.
19+
20+
For [vector](vector-search-overview.md) and [multimodal search](multimodal-search-overview.md), Document Extraction combined with the [Text Split skill](cognitive-search-skill-textsplit.md) is more affordable than other [data chunking approaches](vector-search-how-to-chunk-documents.md). The following tutorials demonstrate skill usage for different scenarios:
21+
22+
+ [Tutorial: Index mixed content using multimodal embeddings and the Document Extraction skill](tutorial-multimodal-indexing-with-embedding-and-doc-extraction.md)
23+
24+
+ [Tutorial: Index mixed content using image verbalizations and the Document Extraction skill](tutorial-multimodal-indexing-with-image-verbalization-and-doc-extraction.md)
1825

1926
> [!NOTE]
2027
> This skill isn't bound to Azure AI services and has no Azure AI services key requirement.
21-
> This skill extracts text and images. Text extraction is free. Image extraction is [metered by Azure AI Search](https://azure.microsoft.com/pricing/details/search/). On a free search service, the cost of 20 transactions per indexer per day is absorbed so that you can complete quickstarts, tutorials, and small projects at no charge. For Basic, Standard, and above, image extraction is billable.
28+
>
29+
> This skill extracts text and images. Text extraction is free. Image extraction is [billable by Azure AI Search](https://azure.microsoft.com/pricing/details/search/). On a free search service, the cost of 20 transactions per indexer per day is absorbed so that you can complete quickstarts, tutorials, and small projects at no charge. For basic and higher tiers, image extraction is billable.
2230
>
2331
2432
## @odata.type

0 commit comments

Comments
 (0)