Skip to content

Commit f3a14a1

Browse files
authored
Merge branch 'release-preview-2-cu' into kate-4466-pro-standard-modes-only
2 parents 049bed5 + b7e20f7 commit f3a14a1

27 files changed

+1106
-586
lines changed
File renamed without changes.

articles/ai-services/content-understanding/concepts/retrieval-augmented-generation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Azure AI Content Understanding retrieval-augmented generation
33
titleSuffix: Azure AI services
44
description: Learn about using Content Understanding and retrieval-augmented generation
55
author: laujan
6-
ms.author: tonyeiyalla
6+
ms.author: paulhsu
77
manager: nitinme
88
ms.service: azure-ai-content-understanding
99
ms.topic: overview
Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
---
2+
title: "Document analysis: extracting structured content with Azure AI Content Understanding"
3+
titleSuffix: Azure AI services
4+
description: Learn about Azure AI Content Understanding's document layout analysis and data extraction capabilities
5+
author: laujan
6+
ms.author: paulhsu
7+
manager: nitinme
8+
ms.service: azure-ai-content-understanding
9+
ms.topic: overview
10+
ms.date: 05/19/2025
11+
---
12+
13+
# Document analysis: extracting structured content
14+
15+
> [!IMPORTANT]
16+
>
17+
> * Azure AI Content Understanding is available in preview. Public preview releases provide early access to features that are in active development.
18+
> * Features, approaches, and processes can change or have limited capabilities, before General Availability (GA).
19+
> * For more information, *see* [**Supplemental Terms of Use for Microsoft Azure Previews**](https://azure.microsoft.com/support/legal/preview-supplemental-terms).
20+
21+
## Overview
22+
23+
Azure AI Content Understanding's document analysis capabilities help you transform unstructured document data into structured, machine-readable information. By precisely identifying and extracting document elements while preserving their structural relationships, you can build powerful document processing workflows for a wide range of applications.
24+
25+
This article explains the document analysis features that enable you to extract meaningful content from your documents, preserve document structures, and unlock the full potential of your document data.
26+
27+
## Document elements
28+
29+
The following document elements can be extracted through content extraction:
30+
31+
* [**Markdown**](#markdown-content-elements)
32+
* Content elements
33+
* [**Words**](#words)
34+
* [**Selection marks**](#selection-marks)
35+
* [**Barcodes**](#barcodes)
36+
* [**Formulas**](#formulas)
37+
* [**Images**](#images)
38+
* Layout elements
39+
* [**Pages**](#pages)
40+
* [**Paragraphs**](#paragraphs)
41+
* [**Lines**](#lines)
42+
* [**Tables**](#tables)
43+
* [**Sections**](#sections)
44+
45+
> [!NOTE]
46+
> Not all content and layout elements are applicable or currently supported by all document file types.
47+
48+
### Markdown content elements
49+
50+
Content Understanding generates richly formatted markdown that preserves the original document's structure, enabling large language models to better comprehend document context and hierarchical relationships for AI-powered analysis and generation tasks. In addition to words, selection marks, barcodes, formulas, and images as content, the markdown also includes sections, tables, and page metadata for both visual rendering and machine processing. Learn more about how Content Understanding represents [content and layout element in markdown](markdown.md).
51+
52+
#### Words
53+
54+
A `word` is a content element composed of a sequence of characters. Content Understanding uses word boundaries defined by [Unicode Standard Annex #29](https://www.unicode.org/reports/tr29/#Word_Boundaries). For Latin languages, words may be split from punctuation even without intervening spaces. In some language, such as Chinese, supplemental word dictionaries are used to enable word breaking at semantic boundaries. For more information, *see* [Boundary Analysis](https://unicode-org.github.io/icu/userguide/boundaryanalysis/).
55+
56+
57+
:::image type="content" source="../media/document/word-boundaries.png" alt-text="Screenshot of detected words.":::
58+
59+
#### Selection marks
60+
61+
A `selection mark` is a content element that represents a visual glyph indicating the state of a selection. They may be represented as check boxes, check marks, radio buttons, etc. The state of a selection mark can be selected or unselected, with different visual representation to indicate the state. They're encoded as words in the document analysis result using `` (selected) and `` (unselected).
62+
63+
Content Understanding detects check marks inside table cell as selection marks in the selected state. However, it doesn't detect empty table cells as selection marks in the unselected state.
64+
65+
:::image type="content" source="../media/document/selection-marks.png" alt-text="Screenshot of detected selection marks.":::
66+
67+
#### Barcodes
68+
69+
A `barcode` is a content element that describes both linear (ex. UPC, EAN) and 2D (ex. QR, MaxiCode) barcodes. Content Understanding represents barcodes using its detected type and extracted value. The following barcode formats are currently accepted:
70+
71+
72+
* `QR Code`
73+
* `Code 39`
74+
* `Code 93`
75+
* `Code 128`
76+
* `UPC (UPC-A & UPC-E)`
77+
* `PDF417`
78+
* `EAN-8`
79+
* `EAN-13`
80+
* `Codabar`
81+
* `Databar`
82+
* `Databar (expanded)`
83+
* `ITF`
84+
* `Data Matrix`
85+
86+
#### Formulas
87+
88+
A `formula` is a content element representing mathematical expressions in the document. It may be an `inline` formula embedded with other text, or an `display` formula that takes up an entire line. Multiline formulas are represented as multiple `display` formula elements grouped into `paragraphs` to preserve mathematical relationships.
89+
90+
#### Images
91+
92+
An `image` is a content element that represents an embedded image, figure, or chart in the document. Content Understanding extracts any embedded text from the images, and any associated captions and footnotes.
93+
94+
### Layout elements
95+
96+
Document layout elements are visual and structural components, such as pages, tables, paragraphs, lines, tables, sections, and overall structure, that help interpret content. Extracting these elements enables tools to analyze documents efficiently for tasks like information retrieval, semantic understanding, and data structuring.
97+
98+
#### Pages
99+
100+
A `page` is a grouping of content that typically corresponds to one side of a sheet of paper. A rendered page is characterized via `width` and `height` in the specified `unit`. In general, images use pixel while PDFs use inch. The `angle` property describes the overall text angle in degrees for pages that may be rotated.
101+
102+
> [!NOTE]
103+
> For spreadsheets like Excel, each sheet is mapped to a page. For presentations, like PowerPoint, each slide is mapped to a page. For file formats like HTML or Word documents, which lack a native page concept without rendering, the entire main content is treated as a single page.
104+
105+
#### Paragraphs
106+
107+
A `paragraph` is an ordered sequence of lines that form a logical unit. Typically, the lines share common alignment and spacing between lines. Paragraphs are often delimited via indentation, added spacing, or bullets/numbering. Some paragraphs may have special functional `role` in the document. Currently supported roles include page header, page footer, page number, title, section heading, footnote, and formula block.
108+
109+
#### Lines
110+
111+
A `line` is an ordered sequence of consecutive content elements, often separated by visual spaces. Content elements in the same horizontal plane (row) but separated by more than a single visual space are most often split into multiple lines. While this feature sometimes splits semantically contiguous content into separate lines, it enables the representation of textual content split into multiple columns or cells. Lines in vertical writing are detected in the vertical direction.
112+
113+
#### Tables
114+
115+
A `table` organizes content into a group of cells in a grid layout. The rows and columns may be visually separated by grid lines, color banding, or greater spacing. The position of a table cell is specified via its row and column indices. A cell can span across multiple rows and columns.
116+
117+
Based on its position and styling, a cell can be classified as general content, row header, column header, stub head, or description:
118+
119+
* A row header cell is typically the first cell in a row that describes the other cells in the row.
120+
121+
* A column header cell is typically the first cell in a column that describes the other cells in a column.
122+
123+
* A row or column can contain multiple header cells to describe hierarchical content.
124+
125+
* A stub head cell is typically the cell in the first row and first column position. It can be empty or describe the values in the header cells in the same row/column.
126+
127+
* A description cell generally appears at the top or bottom most area of a table, describing the overall table content. However, it can sometimes appear in the middle of a table to break the table into sections. Typically, description cells span across multiple cells in a single row.
128+
129+
A table caption specifies content that explains the table. A table can further have a set of footnotes. Unlike a description cell, a caption typically lies outside the grid layout. Table footnotes annotate content inside the table, often marked with footnote symbols. They're often found below the table grid.
130+
131+
A table may span across consecutive pages of a document. In this situation, table continuations in subsequent pages generally maintain the same column count, width, and styling. They often repeat the column headers. Other than page headers, footers, and page numbers, there's generally no intervening content between the initial table and its continuations.
132+
133+
> [!NOTE]
134+
> The span for tables covers only the core content and exclude associated caption and footnotes.
135+
136+
:::image type="content" source="../media/document/table.png" alt-text="Illustration of table using the layout feature.":::
137+
138+
#### Sections
139+
140+
A `section` is a logical grouping of related content elements that form a hierarchical structure within the document. It often starts with a section heading as the first paragraph. A section may contain subsections, creating a nested document structure that preserves semantic relationships.
141+
142+
### Element properties
143+
144+
Documents consist of various components that can be categorized into structural, textual, and form-related elements. These elements not only define the organization and presentation of the document but can also be systematically identified and extracted for further analysis or application.
145+
146+
#### Spans
147+
148+
The `span` property specifies the logical position of the element in the document via the character offset and length into the top-level `markdown` string property. By default, character offsets and lengths are returned in Unicode code points, used by Python 3. To accommodate different development environments that use different character units, user can specify the `stringEncoding` query parameter to return span offsets and lengths in UTF16 code units (Java, JavaScript, .NET) or UTF8 bytes (Go, Rust, Ruby, PHP).
149+
150+
#### Source
151+
152+
The `source` property describes the visual position of the element in the file using an encoded string. For documents, the source string may be in one of the following formats:
153+
* Bounding polygon: `D({pageNumber},{x1},{y1},{x2},{y2},{x3},{y3},{x4},{y4})`
154+
* Axis-aligned bounding box: `D({pageNumber},{left},{top},{width},{height})`
155+
156+
Page numbers are `1-indexed`. The bounding polygon describes a sequence of points, clockwise from the left relative to the natural orientation of the element. For quadrilaterals, the points represent the top-left, top-right, bottom-right, and bottom-left corners. Each point represents the **x**, **y** coordinate in the length unit specified by the `unit` property. In general, the unit of measure for images is pixels while PDFs use inches.
157+
158+
:::image type="content" source="../media/document/bounding-regions.png" alt-text="Screenshot of detected bounding regions.":::
159+
160+
> [!NOTE]
161+
> Currently, Content Understanding only returns `4-point` quadrilaterals as bounding polygons. Future versions may return different number of points to describe more complex shapes, such as curved lines or nonrectangular images. Currently, source is only returned for elements from rendered files (pdf/image).
162+
163+
## Supported content and layout elements
164+
165+
Different file formats support different subsets of content and layout elements. The following table lists the currently supported elements for each file type.
166+
167+
|Document type|Supported format|
168+
|-----|-----|
169+
|**Portable Document Format**|`.pdf`|
170+
|**Image**|`.jpeg/.jpg`, `.png`, `.bmp`, `.tiff`, `.heif`|
171+
|**Microsoft Office**|`.docx`, `.pptx`, `.xls`|
172+
173+
## Next steps
174+
175+
* Try processing your document content using Content Understanding in [Azure AI Foundry](https://aka.ms/cu-landing).
176+
* Learn to analyze document content [**analyzer templates**](../quickstart/use-ai-foundry.md).
177+
* Review code samples: [**visual document search**](https://github.com/Azure-Samples/azure-ai-search-with-content-understanding-python/blob/main/notebooks/search_with_visual_document.ipynb).
178+
* Review code sample: [**analyzer templates**](https://github.com/Azure-Samples/azure-ai-content-understanding-python/tree/main/analyzer_templates).
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
---
2+
title: Content Understanding document modality supported markdown elements
3+
titleSuffix: Azure AI services
4+
description: Description of the supported Markdown elements returned as part of the Content Understanding Document response and how to use the response in your applications.
5+
author: laujan
6+
manager: nitinme
7+
ms.service: azure-ai-content-understanding
8+
ms.topic: conceptual
9+
ms.date: 05/19/2025
10+
ms.author: paulhsu
11+
12+
---
13+
14+
# Document analysis: Markdown representation
15+
16+
Azure AI Content Understanding converts unstructured documents into [GitHub Flavored Markdown](https://github.github.com/gfm), while maintaining content and layout for accurate downstream use. This document describes how each content and layout element is represented in markdown.
17+
18+
## Words and selection marks
19+
20+
Recognized words and detected selection marks are represented in markdown as plain text. Content may be escaped to avoid ambiguity with markdown formatting syntax.
21+
22+
## Barcodes
23+
24+
Barcodes are represented as markdown images with alt text and title: `![alt text](url "title")`.
25+
26+
| Content Type | Markdown Pattern | Example |
27+
| --- | --- | --- |
28+
| Barcode | `![{barcode.kind}]({barcode.path} "{barcode.value}")` | `![QRCode](barcodes/1.2 "https://www.microsoft.com")` |
29+
30+
## Formulas
31+
32+
Mathematical formulas are encoded using LaTeX in Markdown:
33+
34+
* Inline formulas are enclosed in single dollar signs (`$...$`) to maintain text flow.
35+
* Display formulas use double dollar signs (`$$...$$`) for standalone display.
36+
* Multi-line formulas are represented as consecutive display formulas without intervening empty lines, preserving mathematical relationships.
37+
38+
| Formula Kind | Markdown | Visualization |
39+
| --- | --- | --- |
40+
| Inline | `$\sqrt { -1 } $ is $i$` | $\sqrt { -1 } $ is $i$
41+
| Display | `$$a^2 + b^2 = c^2$$` | $a^2 + b^2 = c^2$ |
42+
| Multi-line | `$$( x + 2 ) ^ 2 = x ^ 2 + 4 x + 4$$`<br/>`$$= x ( x + 4 ) + 4$$` | $$( x + 2 ) ^ 2 = x ^ 2 + 4 x + 4$$ $$= x ( x + 4 ) + 4$$ |
43+
44+
## Images
45+
46+
Detected images, including figures and charts, are currently represented using HTML `<figure>` elements in markdown that wrap the detected text in the image. Any caption is represented via an `<figcaption>` elements. Any associated footnotes appear as text immediately after the figure.
47+
48+
``` md
49+
<figure>
50+
<figcaption>Figure 2: Example</figcaption>
51+
52+
Values
53+
300
54+
200
55+
100
56+
0
57+
58+
Jan Feb Mar Apr May Jun Months
59+
60+
</figure>
61+
62+
This is a footnote.
63+
```
64+
65+
## Lines and paragraph
66+
67+
Paragraphs are represented in markdown as a block of text separate by blank lines.
68+
When lines are available, each document line maps to a separate line in the markdown.
69+
70+
## Sections
71+
72+
Paragraphs with title or section heading role are converted into markdown headings. Title, if any, is assigned heading level 1. The heading levels of all other sections are assigned to preserve the detected hierarchical structure.
73+
74+
## Tables
75+
76+
Tables are currently represented in markdown using HTML table markup (`<table>`, `<tr>`, `<th>`, `<td>`) to enable support for merged cells via `rowspan` and `colspan` attributes and rich headers via `<th>`. Any caption is represented via an `<caption>` element. Any associated footnotes appear as text immediately after the table.
77+
78+
:::row:::
79+
:::column:::
80+
81+
``` md
82+
<table>
83+
<caption>Table 1. Example</caption>
84+
<tr><th>Header A</th><th>Header B</th></tr>
85+
<tr><td>Cell 1A</td><td>Cell 1B</td></tr>
86+
<tr><td>Cell 2A</td><td>Cell 2B</td></tr>
87+
</table>
88+
This is a footnote.
89+
```
90+
91+
:::column-end:::
92+
:::column:::
93+
94+
```md
95+
<table>
96+
<caption>Table 1. Example</caption>
97+
<tr><th>Header A</th><th>Header B</th></tr>
98+
<tr><td>Cell 1A</td><td>Cell 1B</td></tr>
99+
<tr><td>Cell 2A</td><td>Cell 2B</td></tr>
100+
</table>
101+
This is a footnote.
102+
```
103+
:::column-end:::
104+
:::row-end:::
105+
106+
## Page metadata
107+
108+
Markdown doesn't natively encode page metadata, such as page numbers, headers, footers, and breaks.
109+
Since this information may be useful for downstream applications, we encode such metadata as HTML comments.
110+
111+
| Metadata | Markdown |
112+
| --- | --- |
113+
| Page number | `<!-- PageNumber="1" -->` |
114+
| Page header | `<!-- PageHeader="Header" -->` |
115+
| Page footer | `<!-- PageNumber="Footer" -->` |
116+
| Page break | `<!-- PageBreak -->` |
117+
118+
## Conclusion
119+
120+
Content Understanding's Markdown elements provide a powerful way to represent the structure and content of analyzed documents. By understanding and properly utilizing these Markdown elements, you can enhance your document processing workflows and build more sophisticated content extraction applications.
121+
122+
## Next steps
123+
124+
* Try processing your document content using Content Understanding in [Azure AI Foundry](https://aka.ms/cu-landing).
125+
* Learn to analyze document content [**analyzer templates**](../quickstart/use-ai-foundry.md).
126+
* Review code samples: [**visual document search**](https://github.com/Azure-Samples/azure-ai-search-with-content-understanding-python/blob/main/notebooks/search_with_visual_document.ipynb).
127+
* Review code sample: [**analyzer templates**](https://github.com/Azure-Samples/azure-ai-content-understanding-python/tree/main/analyzer_templates).
128+
129+

0 commit comments

Comments
 (0)