|
| 1 | +--- |
| 2 | +title: "Document analysis: extracting structured content with Azure AI Content Understanding" |
| 3 | +titleSuffix: Azure AI services |
| 4 | +description: Learn about Azure AI Content Understanding's document layout analysis and data extraction capabilities |
| 5 | +author: laujan |
| 6 | +ms.author: paulhsu |
| 7 | +manager: nitinme |
| 8 | +ms.service: azure-ai-content-understanding |
| 9 | +ms.topic: overview |
| 10 | +ms.date: 05/19/2025 |
| 11 | +--- |
| 12 | + |
| 13 | +# Document analysis: extracting structured content |
| 14 | + |
| 15 | +> [!IMPORTANT] |
| 16 | +> |
| 17 | +> * Azure AI Content Understanding is available in preview. Public preview releases provide early access to features that are in active development. |
| 18 | +> * Features, approaches, and processes can change or have limited capabilities, before General Availability (GA). |
| 19 | +> * For more information, *see* [**Supplemental Terms of Use for Microsoft Azure Previews**](https://azure.microsoft.com/support/legal/preview-supplemental-terms). |
| 20 | +
|
| 21 | +## Overview |
| 22 | + |
| 23 | +Azure AI Content Understanding's document analysis capabilities help you transform unstructured document data into structured, machine-readable information. By precisely identifying and extracting document elements while preserving their structural relationships, you can build powerful document processing workflows for a wide range of applications. |
| 24 | + |
| 25 | +This article explains the document analysis features that enable you to extract meaningful content from your documents, preserve document structures, and unlock the full potential of your document data. |
| 26 | + |
| 27 | +## Document elements |
| 28 | + |
| 29 | +The following document elements can be extracted through content extraction: |
| 30 | + |
| 31 | +* [**Markdown**](#markdown-content-elements) |
| 32 | +* Content elements |
| 33 | + * [**Words**](#words) |
| 34 | + * [**Selection marks**](#selection-marks) |
| 35 | + * [**Barcodes**](#barcodes) |
| 36 | + * [**Formulas**](#formulas) |
| 37 | + * [**Images**](#images) |
| 38 | +* Layout elements |
| 39 | + * [**Pages**](#pages) |
| 40 | + * [**Paragraphs**](#paragraphs) |
| 41 | + * [**Lines**](#lines) |
| 42 | + * [**Tables**](#tables) |
| 43 | + * [**Sections**](#sections) |
| 44 | + |
| 45 | +> [!NOTE] |
| 46 | +> Not all content and layout elements are applicable or currently supported by all document file types. |
| 47 | +
|
| 48 | +### Markdown content elements |
| 49 | + |
| 50 | +Content Understanding generates richly formatted markdown that preserves the original document's structure, enabling large language models to better comprehend document context and hierarchical relationships for AI-powered analysis and generation tasks. In addition to words, selection marks, barcodes, formulas, and images as content, the markdown also includes sections, tables, and page metadata for both visual rendering and machine processing. Learn more about how Content Understanding represents [content and layout element in markdown](markdown.md). |
| 51 | + |
| 52 | +#### Words |
| 53 | + |
| 54 | +A `word` is a content element composed of a sequence of characters. Content Understanding uses word boundaries defined by [Unicode Standard Annex #29](https://www.unicode.org/reports/tr29/#Word_Boundaries). For Latin languages, words may be split from punctuation even without intervening spaces. In some language, such as Chinese, supplemental word dictionaries are used to enable word breaking at semantic boundaries. For more information, *see* [Boundary Analysis](https://unicode-org.github.io/icu/userguide/boundaryanalysis/). |
| 55 | + |
| 56 | + |
| 57 | +:::image type="content" source="../media/document/word-boundaries.png" alt-text="Screenshot of detected words."::: |
| 58 | + |
| 59 | +#### Selection marks |
| 60 | + |
| 61 | +A `selection mark` is a content element that represents a visual glyph indicating the state of a selection. They may be represented as check boxes, check marks, radio buttons, etc. The state of a selection mark can be selected or unselected, with different visual representation to indicate the state. They're encoded as words in the document analysis result using `☒` (selected) and `☐` (unselected). |
| 62 | + |
| 63 | +Content Understanding detects check marks inside table cell as selection marks in the selected state. However, it doesn't detect empty table cells as selection marks in the unselected state. |
| 64 | + |
| 65 | +:::image type="content" source="../media/document/selection-marks.png" alt-text="Screenshot of detected selection marks."::: |
| 66 | + |
| 67 | +#### Barcodes |
| 68 | + |
| 69 | +A `barcode` is a content element that describes both linear (ex. UPC, EAN) and 2D (ex. QR, MaxiCode) barcodes. Content Understanding represents barcodes using its detected type and extracted value. The following barcode formats are currently accepted: |
| 70 | + |
| 71 | + |
| 72 | +* `QR Code` |
| 73 | +* `Code 39` |
| 74 | +* `Code 93` |
| 75 | +* `Code 128` |
| 76 | +* `UPC (UPC-A & UPC-E)` |
| 77 | +* `PDF417` |
| 78 | +* `EAN-8` |
| 79 | +* `EAN-13` |
| 80 | +* `Codabar` |
| 81 | +* `Databar` |
| 82 | +* `Databar (expanded)` |
| 83 | +* `ITF` |
| 84 | +* `Data Matrix` |
| 85 | + |
| 86 | +#### Formulas |
| 87 | + |
| 88 | +A `formula` is a content element representing mathematical expressions in the document. It may be an `inline` formula embedded with other text, or an `display` formula that takes up an entire line. Multiline formulas are represented as multiple `display` formula elements grouped into `paragraphs` to preserve mathematical relationships. |
| 89 | + |
| 90 | +#### Images |
| 91 | + |
| 92 | +An `image` is a content element that represents an embedded image, figure, or chart in the document. Content Understanding extracts any embedded text from the images, and any associated captions and footnotes. |
| 93 | + |
| 94 | +### Layout elements |
| 95 | + |
| 96 | +Document layout elements are visual and structural components, such as pages, tables, paragraphs, lines, tables, sections, and overall structure, that help interpret content. Extracting these elements enables tools to analyze documents efficiently for tasks like information retrieval, semantic understanding, and data structuring. |
| 97 | + |
| 98 | +#### Pages |
| 99 | + |
| 100 | +A `page` is a grouping of content that typically corresponds to one side of a sheet of paper. A rendered page is characterized via `width` and `height` in the specified `unit`. In general, images use pixel while PDFs use inch. The `angle` property describes the overall text angle in degrees for pages that may be rotated. |
| 101 | + |
| 102 | +> [!NOTE] |
| 103 | +> For spreadsheets like Excel, each sheet is mapped to a page. For presentations, like PowerPoint, each slide is mapped to a page. For file formats like HTML or Word documents, which lack a native page concept without rendering, the entire main content is treated as a single page. |
| 104 | +
|
| 105 | +#### Paragraphs |
| 106 | + |
| 107 | +A `paragraph` is an ordered sequence of lines that form a logical unit. Typically, the lines share common alignment and spacing between lines. Paragraphs are often delimited via indentation, added spacing, or bullets/numbering. Some paragraphs may have special functional `role` in the document. Currently supported roles include page header, page footer, page number, title, section heading, footnote, and formula block. |
| 108 | + |
| 109 | +#### Lines |
| 110 | + |
| 111 | +A `line` is an ordered sequence of consecutive content elements, often separated by visual spaces. Content elements in the same horizontal plane (row) but separated by more than a single visual space are most often split into multiple lines. While this feature sometimes splits semantically contiguous content into separate lines, it enables the representation of textual content split into multiple columns or cells. Lines in vertical writing are detected in the vertical direction. |
| 112 | + |
| 113 | +#### Tables |
| 114 | + |
| 115 | +A `table` organizes content into a group of cells in a grid layout. The rows and columns may be visually separated by grid lines, color banding, or greater spacing. The position of a table cell is specified via its row and column indices. A cell can span across multiple rows and columns. |
| 116 | + |
| 117 | +Based on its position and styling, a cell can be classified as general content, row header, column header, stub head, or description: |
| 118 | + |
| 119 | +* A row header cell is typically the first cell in a row that describes the other cells in the row. |
| 120 | + |
| 121 | +* A column header cell is typically the first cell in a column that describes the other cells in a column. |
| 122 | + |
| 123 | +* A row or column can contain multiple header cells to describe hierarchical content. |
| 124 | + |
| 125 | +* A stub head cell is typically the cell in the first row and first column position. It can be empty or describe the values in the header cells in the same row/column. |
| 126 | + |
| 127 | +* A description cell generally appears at the top or bottom most area of a table, describing the overall table content. However, it can sometimes appear in the middle of a table to break the table into sections. Typically, description cells span across multiple cells in a single row. |
| 128 | + |
| 129 | +A table caption specifies content that explains the table. A table can further have a set of footnotes. Unlike a description cell, a caption typically lies outside the grid layout. Table footnotes annotate content inside the table, often marked with footnote symbols. They're often found below the table grid. |
| 130 | + |
| 131 | +A table may span across consecutive pages of a document. In this situation, table continuations in subsequent pages generally maintain the same column count, width, and styling. They often repeat the column headers. Other than page headers, footers, and page numbers, there's generally no intervening content between the initial table and its continuations. |
| 132 | + |
| 133 | +> [!NOTE] |
| 134 | +> The span for tables covers only the core content and exclude associated caption and footnotes. |
| 135 | +
|
| 136 | +:::image type="content" source="../media/document/table.png" alt-text="Illustration of table using the layout feature."::: |
| 137 | + |
| 138 | +#### Sections |
| 139 | + |
| 140 | +A `section` is a logical grouping of related content elements that form a hierarchical structure within the document. It often starts with a section heading as the first paragraph. A section may contain subsections, creating a nested document structure that preserves semantic relationships. |
| 141 | + |
| 142 | +### Element properties |
| 143 | + |
| 144 | +Documents consist of various components that can be categorized into structural, textual, and form-related elements. These elements not only define the organization and presentation of the document but can also be systematically identified and extracted for further analysis or application. |
| 145 | + |
| 146 | +#### Spans |
| 147 | + |
| 148 | +The `span` property specifies the logical position of the element in the document via the character offset and length into the top-level `markdown` string property. By default, character offsets and lengths are returned in Unicode code points, used by Python 3. To accommodate different development environments that use different character units, user can specify the `stringEncoding` query parameter to return span offsets and lengths in UTF16 code units (Java, JavaScript, .NET) or UTF8 bytes (Go, Rust, Ruby, PHP). |
| 149 | + |
| 150 | +#### Source |
| 151 | + |
| 152 | +The `source` property describes the visual position of the element in the file using an encoded string. For documents, the source string may be in one of the following formats: |
| 153 | +* Bounding polygon: `D({pageNumber},{x1},{y1},{x2},{y2},{x3},{y3},{x4},{y4})` |
| 154 | +* Axis-aligned bounding box: `D({pageNumber},{left},{top},{width},{height})` |
| 155 | + |
| 156 | +Page numbers are `1-indexed`. The bounding polygon describes a sequence of points, clockwise from the left relative to the natural orientation of the element. For quadrilaterals, the points represent the top-left, top-right, bottom-right, and bottom-left corners. Each point represents the **x**, **y** coordinate in the length unit specified by the `unit` property. In general, the unit of measure for images is pixels while PDFs use inches. |
| 157 | + |
| 158 | +:::image type="content" source="../media/document/bounding-regions.png" alt-text="Screenshot of detected bounding regions."::: |
| 159 | + |
| 160 | +> [!NOTE] |
| 161 | +> Currently, Content Understanding only returns `4-point` quadrilaterals as bounding polygons. Future versions may return different number of points to describe more complex shapes, such as curved lines or nonrectangular images. Currently, source is only returned for elements from rendered files (pdf/image). |
| 162 | +
|
| 163 | +## Supported content and layout elements |
| 164 | + |
| 165 | +Different file formats support different subsets of content and layout elements. The following table lists the currently supported elements for each file type. |
| 166 | + |
| 167 | +|Document type|Supported format| |
| 168 | +|-----|-----| |
| 169 | +|**Portable Document Format**|`.pdf`| |
| 170 | +|**Image**|`.jpeg/.jpg`, `.png`, `.bmp`, `.tiff`, `.heif`| |
| 171 | +|**Microsoft Office**|`.docx`, `.pptx`, `.xls`| |
| 172 | + |
| 173 | +## Next steps |
| 174 | + |
| 175 | +* Try processing your document content using Content Understanding in [Azure AI Foundry](https://aka.ms/cu-landing). |
| 176 | +* Learn to analyze document content [**analyzer templates**](../quickstart/use-ai-foundry.md). |
| 177 | +* Review code samples: [**visual document search**](https://github.com/Azure-Samples/azure-ai-search-with-content-understanding-python/blob/main/notebooks/search_with_visual_document.ipynb). |
| 178 | +* Review code sample: [**analyzer templates**](https://github.com/Azure-Samples/azure-ai-content-understanding-python/tree/main/analyzer_templates). |
0 commit comments