You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -99,17 +99,24 @@ Try extracting text from forms and documents using the Document Intelligence Stu
99
99
:::image type="content" source="media/studio/run-analysis-analyze-options.png" alt-text="Screenshot of Run analysis and Analyze options buttons in the Document Intelligence Studio.":::
*See* our [Language Support—document analysis models](language-support-ocr.md) page for a complete list of supported languages.
106
+
See our [Language Support—document analysis models](language-support-ocr.md) page for a complete list of supported languages.
107
107
108
108
## Data extraction
109
109
110
+
> [!NOTE]
111
+
> Microsoft Word and HTML file are supported in v3.1 and later versions. Compared with PDF and images, below features are not supported:
112
+
> - There are no angle, width/height and unit with each page object.
113
+
> - For each object detected, there is no bounding polygon or bounding region.
114
+
> - Page range (`pages`) is not supported as a parameter.
115
+
> - No `lines` object.
116
+
110
117
### Pages
111
118
112
-
The pages collection is the first object you see in the service response. The page units in the model output are computed as shown:
119
+
The pages collection is a list of pages within the document. For each page, it is represented with the sequential number of the page within the document, the orientation angle, which could indicate if the page has been rotated, the width and height (dimentions in pixels) of the page. The page units in the model output are computed as shown:
@@ -131,8 +138,7 @@ The pages collection is the first object you see in the service response. The pa
131
138
"unit": "pixel",
132
139
"words": [],
133
140
"lines": [],
134
-
"spans": [],
135
-
"kind": "document"
141
+
"spans": []
136
142
}
137
143
]
138
144
```
@@ -141,12 +147,9 @@ The pages collection is the first object you see in the service response. The pa
141
147
142
148
For large multi-page PDF documents, use the `pages` query parameter to indicate specific page numbers or page ranges for text extraction.
143
149
144
-
> [!NOTE]
145
-
> For the Microsoft Word and HTML file support, the API ignores the pages parameter and extracts all pages by default.
146
-
147
150
### Paragraphs
148
151
149
-
The Read OCR model in Document Intelligence extracts all identified blocks of text in the `paragraphs` collection as a top level object under `analyzeResults`. Each entry in this collection represents a text block and includes the extracted text as`content`and the bounding `polygon` coordinates. The `span` information points to the text fragment within the top-level `content` property that contains the full text from the document.
152
+
The Read OCR model in Document Intelligence extracts all identified blocks of text in the `paragraphs` collection as a top level object under `analyzeResults`. Each entry in this collection represents a text block and includes the extracted text as`content`and the bounding `polygon` coordinates. The `span` information points to the text fragment within the top-level `content` property that contains the full text from the document.
150
153
151
154
```json
152
155
"paragraphs": [
@@ -162,7 +165,7 @@ The Read OCR model in Document Intelligence extracts all identified blocks of te
162
165
163
166
The Read OCR model extracts print and handwritten style text as `lines` and `words`. The model outputs bounding `polygon` coordinates and `confidence` for the extracted words. The `styles` collection includes any handwritten style for lines if detected along with the spans pointing to the associated text. This feature applies to [supported handwritten languages](language-support.md).
164
167
165
-
For Microsoft Word, Excel, PowerPoint, and HTML, Document Intelligence version 2023-10-31-preview the Read model extracts all embedded text as is. For embedded images, it uses OCR technology to extract text from each image and the extraction as an added entry to the `pages` collection. Added entries include the extracted text, lines, and words, their bounding polygons, confidences, and the spans pointing to the associated text.
168
+
For Microsoft Word, Excel, PowerPoint, and HTML, Document Intelligence Read model v3.1 and later versions extracts all embedded text as is. Texts are extrated as wordsand paragraphs. Embedded images are not supported.
166
169
167
170
168
171
```json
@@ -201,6 +204,8 @@ The response includes classifying whether each text line is of handwriting style
201
204
}
202
205
```
203
206
207
+
If you have turned on [font/style addon capability](concept-add-on-capabilities.md#font-property-extraction), you will also get the font/style result as part of the `styles` object.
208
+
204
209
## Next steps
205
210
206
211
Complete a Document Intelligence quickstart:
@@ -216,4 +221,4 @@ Complete a Document Intelligence quickstart:
216
221
Explore our REST API:
217
222
218
223
> [!div class="nextstepaction"]
219
-
> [Document Intelligence API v4.0](/rest/api/aiservices/document-models/analyze-document?view=rest-aiservices-2023-10-31-preview&preserve-view=true&tabs=HTTP)
224
+
> [Document Intelligence API v4.0](/rest/api/aiservices/document-models/analyze-document?view=rest-aiservices-2023-10-31-preview&preserve-view=true&tabs=HTTP)
0 commit comments