Skip to content

Commit d064987

Browse files
authored
Unstructured element ontology and HTML-element mapping, plus links to latest element and metadata listings in source (#608)
1 parent 3edde38 commit d064987

File tree

1 file changed

+138
-5
lines changed

1 file changed

+138
-5
lines changed

ui/document-elements.mdx

Lines changed: 138 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,11 @@ If you apply chunking, you will also see the `CompositeElement` type.
6565
A composite element might be formed by combining one or more sequential elements produced by partitioning. For example,
6666
several individual list items might be combined into a single chunk.
6767

68+
For the most up-to-date list of available element types, see the `TYPE_TO_TEXT_ELEMENT_MAP` type-annotated mapping definition and the
69+
`ElementType` class definition in the [elements.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py)
70+
file, located in the [Unstructured-IO/unstructured](https://github.com/Unstructured-IO/unstructured)
71+
repository in GitHub.
72+
6873
## Element ID
6974

7075
By default, the element ID is a SHA-256 hash of the element's text, its position on the page, the page number it's on,
@@ -79,6 +84,11 @@ Element metadata enables you to do things such as:
7984
* Filter file elements based on an element's metadata value. For instance, you might want to limit your scope to elements from a certain page, or you might want to use only elements that have an email matching a regular expression in their metadata.
8085
* Map an element to the page where it occurred so that the original page can be retrieved when that element matches search criteria.
8186

87+
For the most up-to-date list of all available metadata fields, see the
88+
`ElementMetadata` class definition in the [elements.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py)
89+
file, located in the [Unstructured-IO/unstructured](https://github.com/Unstructured-IO/unstructured)
90+
repository in GitHub.
91+
8292
### Common metadata fields
8393

8494
All file types return the following `metadata` fields when the information is available from the source file:
@@ -116,7 +126,6 @@ The `coordinates` metadata field contains:
116126
- `points` : These specify the corners of the bounding box starting from the top left corner and proceeding counter-clockwise. The points represent pixels, the origin is in the top left and the `y` coordinate increases in the downward direction.
117127
- `system`: The points have an associated coordinate system. A typical example of a coordinate system is `PixelSpace`, which is used for representing the coordinates of images. The coordinate system has a name, orientation, layout width, and layout height.
118128

119-
120129
### Additional metadata fields by file type
121130

122131
| Field name | Applicable file types | Description |
@@ -155,16 +164,52 @@ For `Table` elements, the raw text of the table will be stored in the `text` att
155164
of the table will be available in the element metadata under `text_as_html`.
156165
Unstructured will automatically extract all tables for all doc types if you check the **Infer Table Structure** in the **ConnectorSettings** area of the **Transform** section of a workflow.
157166

158-
Here's an example of a table element. The `text` of the element will look like this:
167+
Here's an example of a table element. The `text` of the element will look like this (line breaks are added here for readability):
159168

160169
```text
161-
Dataset Base Model1 Large Model Notes PubLayNet [38] F / M M Layouts of modern scientific documents PRImA [3] M - Layouts of scanned modern magazines and scientific reports Newspaper [17] F - Layouts of scanned US newspapers from the 20th century TableBank [18] F F Table region on modern scientific and business document HJDataset [31] F / M - Layouts of history Japanese documents
170+
Dataset Base Model1 Large Model Notes
171+
PubLayNet [38] F / M M Layouts of modern scientific documents
172+
PRImA [3] M - Layouts of scanned modern magazines and scientific reports
173+
Newspaper [17] F - Layouts of scanned US newspapers from the 20th century
174+
TableBank [18] F F Table region on modern scientific and business document
175+
HJDataset [31] F / M - Layouts of history Japanese documents
162176
```
163177

164-
And the `text_as_html` metadata for the same element will look like this:
178+
And the `text_as_html` metadata for the same element will look like this (line breaks are added here for readability):
165179

166180
```html
167-
<table><thead><th>Dataset</th><th>| Base Model’</th><th>| Notes</th></thead><tr><td>PubLayNet</td><td>[38] F/M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA [3]</td><td>M</td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper</td><td>F</td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td>F</td><td>Table region on modern scientific and business document</td></tr><tr><td>HJDataset [31]</td><td>F/M</td><td>Layouts of history Japanese documents</td></tr></table>
181+
<table>
182+
<thead>
183+
<th>Dataset</th>
184+
<th>| Base Model’</th>
185+
<th>| Notes</th>
186+
</thead>
187+
<tr>
188+
<td>PubLayNet</td>
189+
<td>[38] F/M</td>
190+
<td>Layouts of modern scientific documents</td>
191+
</tr>
192+
<tr>
193+
<td>PRImA [3]</td>
194+
<td>M</td>
195+
<td>Layouts of scanned modern magazines and scientific reports</td>
196+
</tr>
197+
<tr>
198+
<td>Newspaper</td>
199+
<td>F</td>
200+
<td>Layouts of scanned US newspapers from the 20th century</td>
201+
</tr>
202+
<tr>
203+
<td>TableBank</td>
204+
<td>F</td>
205+
<td>Table region on modern scientific and business document</td>
206+
</tr>
207+
<tr>
208+
<td>HJDataset [31]</td>
209+
<td>F/M</td>
210+
<td>Layouts of history Japanese documents</td>
211+
</tr>
212+
</table>
168213
```
169214

170215
### Data connector metadata fields
@@ -191,3 +236,91 @@ Documents can include additional file metadata, based on the specified source co
191236
| OneDrive | `server_relative_path`, `user_pname` |
192237
| S3 | `protocol`, `remote_file_path` |
193238
| SharePoint | `server_path`, `site_url` |
239+
240+
### VLM generated HTML elements
241+
242+
The vision language model (VLM) [partitioner](/ui/partitioning) also generates an HTML representation of the Unstructured elements that are produced.
243+
Unstructured has developed an element ontology that assigns incoming Unstructured elements to these various defined element ontology types.
244+
These element ontology types are used to generate standard HTML elements with the element ontology type as class attributes on
245+
those HTML elements. The generated HTML elements
246+
are output as `text_as_html` along with their `parent_id` in `metadata`, to allow for easier HTML reconstruction of the entire document as needed.
247+
248+
For example, given the following table element produced with the VLM partitioner, the `text_as_html` field is an HTML representation of the
249+
derived table, and `parent_id` is the `element_id` of the Unstructured element for the page that contains this table. (Line breaks are added here to the
250+
`text` and `text_as_html` fields for readability.)
251+
252+
```json
253+
{
254+
"type": "Table",
255+
"element_id": "c60aea37616e3db75660918c6d657c38",
256+
"text": "ITEM QUANTITY PRICE TOTAL
257+
Office Desk (Oak wood, 140x70 cm) 2 $249 $498
258+
Ergonomic Chair (Adjustable height & lumbar support) 3 $189 $567
259+
Whiteboard Set (Magnetic, 90x60 cm + 4 markers) 2 $59 $118
260+
SUBTOTAL $1,183
261+
VAT (19%) $224.77
262+
TOTAL $1,407.77",
263+
"metadata": {
264+
"category_depth": 1,
265+
"page_number": 1,
266+
"parent_id": "8cc3b39afcd948d49d85084eaae80ff8",
267+
"text_as_html":
268+
"<table class=\"Table\" id=\"958308a90ccd4fcb825cb12eed20d103\">
269+
<thead>
270+
<tr>
271+
<th>ITEM</th>
272+
<th>QUANTITY</th>
273+
<th>PRICE</th>
274+
<th>TOTAL</th>
275+
</tr>
276+
</thead>
277+
<tbody>
278+
<tr>
279+
<td>Office Desk (Oak wood, 140x70 cm)</td>
280+
<td>2</td>
281+
<td>$249</td>
282+
<td>$498</td>
283+
</tr>
284+
<tr>
285+
<td>Ergonomic Chair (Adjustable height &amp; lumbar support)</td>
286+
<td>3</td>
287+
<td>$189</td>
288+
<td>$567</td>
289+
</tr>
290+
<tr>
291+
<td>Whiteboard Set (Magnetic, 90x60 cm + 4 markers)</td>
292+
<td>2</td>
293+
<td>$59</td>
294+
<td>$118</td>
295+
</tr>
296+
<tr>
297+
<td colspan=\"3\">SUBTOTAL</td>
298+
<td>$1,183</td>
299+
</tr>
300+
<tr>
301+
<td colspan=\"3\">VAT (19%)</td>
302+
<td>$224.77</td>
303+
</tr>
304+
<tr>
305+
<td colspan=\"3\">TOTAL</td>
306+
<td>$1,407.77</td>
307+
</tr>
308+
</tbody>
309+
</table>",
310+
"languages": [
311+
"eng"
312+
],
313+
"filetype": "application/pdf",
314+
"partitioner_type": "vlm_partition",
315+
"filename": "invoice.pdf"
316+
}
317+
}
318+
```
319+
320+
For the most up-to-date list of available element ontology types, see the
321+
[ontology.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/ontology.py)
322+
file, located in the [Unstructured-IO/unstructured](https://github.com/Unstructured-IO/unstructured) repository in GitHub.
323+
324+
For the most up-to-date list of mappings between element ontology types and Unstructured element types, see the
325+
[mappings.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/mappings.py)
326+
file, located in the [Unstructured-IO/unstructured](https://github.com/Unstructured-IO/unstructured) repository in GitHub.

0 commit comments

Comments
 (0)