You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ui/document-elements.mdx
+138-5Lines changed: 138 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -65,6 +65,11 @@ If you apply chunking, you will also see the `CompositeElement` type.
65
65
A composite element might be formed by combining one or more sequential elements produced by partitioning. For example,
66
66
several individual list items might be combined into a single chunk.
67
67
68
+
For the most up-to-date list of available element types, see the `TYPE_TO_TEXT_ELEMENT_MAP` type-annotated mapping definition and the
69
+
`ElementType` class definition in the [elements.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py)
70
+
file, located in the [Unstructured-IO/unstructured](https://github.com/Unstructured-IO/unstructured)
71
+
repository in GitHub.
72
+
68
73
## Element ID
69
74
70
75
By default, the element ID is a SHA-256 hash of the element's text, its position on the page, the page number it's on,
@@ -79,6 +84,11 @@ Element metadata enables you to do things such as:
79
84
* Filter file elements based on an element's metadata value. For instance, you might want to limit your scope to elements from a certain page, or you might want to use only elements that have an email matching a regular expression in their metadata.
80
85
* Map an element to the page where it occurred so that the original page can be retrieved when that element matches search criteria.
81
86
87
+
For the most up-to-date list of all available metadata fields, see the
88
+
`ElementMetadata` class definition in the [elements.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py)
89
+
file, located in the [Unstructured-IO/unstructured](https://github.com/Unstructured-IO/unstructured)
90
+
repository in GitHub.
91
+
82
92
### Common metadata fields
83
93
84
94
All file types return the following `metadata` fields when the information is available from the source file:
@@ -116,7 +126,6 @@ The `coordinates` metadata field contains:
116
126
-`points` : These specify the corners of the bounding box starting from the top left corner and proceeding counter-clockwise. The points represent pixels, the origin is in the top left and the `y` coordinate increases in the downward direction.
117
127
-`system`: The points have an associated coordinate system. A typical example of a coordinate system is `PixelSpace`, which is used for representing the coordinates of images. The coordinate system has a name, orientation, layout width, and layout height.
118
128
119
-
120
129
### Additional metadata fields by file type
121
130
122
131
| Field name | Applicable file types | Description |
@@ -155,16 +164,52 @@ For `Table` elements, the raw text of the table will be stored in the `text` att
155
164
of the table will be available in the element metadata under `text_as_html`.
156
165
Unstructured will automatically extract all tables for all doc types if you check the **Infer Table Structure** in the **ConnectorSettings** area of the **Transform** section of a workflow.
157
166
158
-
Here's an example of a table element. The `text` of the element will look like this:
167
+
Here's an example of a table element. The `text` of the element will look like this (line breaks are added here for readability):
159
168
160
169
```text
161
-
Dataset Base Model1 Large Model Notes PubLayNet [38] F / M M Layouts of modern scientific documents PRImA [3] M - Layouts of scanned modern magazines and scientific reports Newspaper [17] F - Layouts of scanned US newspapers from the 20th century TableBank [18] F F Table region on modern scientific and business document HJDataset [31] F / M - Layouts of history Japanese documents
170
+
Dataset Base Model1 Large Model Notes
171
+
PubLayNet [38] F / M M Layouts of modern scientific documents
172
+
PRImA [3] M - Layouts of scanned modern magazines and scientific reports
173
+
Newspaper [17] F - Layouts of scanned US newspapers from the 20th century
174
+
TableBank [18] F F Table region on modern scientific and business document
175
+
HJDataset [31] F / M - Layouts of history Japanese documents
162
176
```
163
177
164
-
And the `text_as_html` metadata for the same element will look like this:
178
+
And the `text_as_html` metadata for the same element will look like this (line breaks are added here for readability):
165
179
166
180
```html
167
-
<table><thead><th>Dataset</th><th>| Base Model’</th><th>| Notes</th></thead><tr><td>PubLayNet</td><td>[38] F/M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA [3]</td><td>M</td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper</td><td>F</td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td>F</td><td>Table region on modern scientific and business document</td></tr><tr><td>HJDataset [31]</td><td>F/M</td><td>Layouts of history Japanese documents</td></tr></table>
181
+
<table>
182
+
<thead>
183
+
<th>Dataset</th>
184
+
<th>| Base Model’</th>
185
+
<th>| Notes</th>
186
+
</thead>
187
+
<tr>
188
+
<td>PubLayNet</td>
189
+
<td>[38] F/M</td>
190
+
<td>Layouts of modern scientific documents</td>
191
+
</tr>
192
+
<tr>
193
+
<td>PRImA [3]</td>
194
+
<td>M</td>
195
+
<td>Layouts of scanned modern magazines and scientific reports</td>
196
+
</tr>
197
+
<tr>
198
+
<td>Newspaper</td>
199
+
<td>F</td>
200
+
<td>Layouts of scanned US newspapers from the 20th century</td>
201
+
</tr>
202
+
<tr>
203
+
<td>TableBank</td>
204
+
<td>F</td>
205
+
<td>Table region on modern scientific and business document</td>
206
+
</tr>
207
+
<tr>
208
+
<td>HJDataset [31]</td>
209
+
<td>F/M</td>
210
+
<td>Layouts of history Japanese documents</td>
211
+
</tr>
212
+
</table>
168
213
```
169
214
170
215
### Data connector metadata fields
@@ -191,3 +236,91 @@ Documents can include additional file metadata, based on the specified source co
191
236
| OneDrive |`server_relative_path`, `user_pname`|
192
237
| S3 |`protocol`, `remote_file_path`|
193
238
| SharePoint |`server_path`, `site_url`|
239
+
240
+
### VLM generated HTML elements
241
+
242
+
The vision language model (VLM) [partitioner](/ui/partitioning) also generates an HTML representation of the Unstructured elements that are produced.
243
+
Unstructured has developed an element ontology that assigns incoming Unstructured elements to these various defined element ontology types.
244
+
These element ontology types are used to generate standard HTML elements with the element ontology type as class attributes on
245
+
those HTML elements. The generated HTML elements
246
+
are output as `text_as_html` along with their `parent_id` in `metadata`, to allow for easier HTML reconstruction of the entire document as needed.
247
+
248
+
For example, given the following table element produced with the VLM partitioner, the `text_as_html` field is an HTML representation of the
249
+
derived table, and `parent_id` is the `element_id` of the Unstructured element for the page that contains this table. (Line breaks are added here to the
250
+
`text` and `text_as_html` fields for readability.)
0 commit comments