Skip to content

Commit 267bbe7

Browse files
authored
Behavior for orig_elements and image_base64 (#749)
1 parent f81893a commit 267bbe7

File tree

4 files changed

+20
-4
lines changed

4 files changed

+20
-4
lines changed

open-source/core-functionality/chunking.mdx

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -181,6 +181,10 @@ for element in elements:
181181
# ListItem: Violets are blue
182182
```
183183

184+
If High Res partitioning is used, the `orig_elements` value of each chunked element will contain an `image_base64` value for each of the `Image` and `Table` elements. To access the original content for an `image_base64` field,
185+
[decompress and then Base64-decode](/open-source/how-to/get-chunked-elements) the `orig_elements` value. Then
186+
[Base64-decode](/open-source/how-to/extract-image-block-types) the resulting `image_base64` values.
187+
184188
## Learn more
185189

186190
<Icon icon="blog" />&nbsp;&nbsp;[Chunking for RAG: best practices](https://unstructured.io/blog/chunking-for-rag-best-practices)

snippets/concepts/document-elements.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,10 +93,13 @@ All document types return the following metadata fields when the information is
9393
| `coordinates` | XY Bounding Box Coordinates. See notes below for further details about the bounding box. |
9494
| `parent_id` | Element Hierarchy. `parent_id` may be used to infer where an element resides within the overall hierarchy of a document. For instance, a NarrativeText element may have a Title element as a parent (a “sub-title”), which in turn may have another Title element as its parent (a "title"). |
9595
| `category_depth` | Element depth relative to other elements of the same category. Category depth is the depth of an element relative to other elements of the same category. It’s set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies. Category depth may be set using native document hierarchies, e.g. reflecting \<H1>, \<H2>, or \<H3> tags within an HTML document or the indentation level of a bulleted list item in a Word document. |
96+
| `image_base64` | A Base64 representation of the detected image or table. Only applicable to image and table elements when High Res partitioning is used. After chunking, `image_base64` is not preserved in the output. |
97+
| `image_mime_type` | MIME type of the image. Only applicable to image elements. |
9698
| `text_as_html` | HTML representation of extracted tables. Only applicable to table elements. |
9799
| `languages` | Document Languages. At document level or element level. List is ordered by probability of being the primary language of the text. |
98100
| `emphasized_text_contents` | Emphasized text (bold or italic) in the original document. |
99101
| `emphasized_text_tags` | Tags on text that is emphasized in the original document. |
102+
| `orig_elements` | For chunked elements, a list of the original elements that were used to create the current chunked element. |
100103
| `is_continuation` | True if element is a continuation of a previous element. Only relevant for chunking, if an element was divided into two due to max_characters. |
101104
| `detection_class_prob` | Detection model class probabilities. From unstructured-inference, hi-res strategy. |
102105

ui/chunking.mdx

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,6 @@ text element that was too big to fit in one chunk and required splitting.
2222
- `Table`: A table element is not combined with other elements, and if it fits within the max characters setting it will remain as is.
2323
- `TableChunk`: Large tables that exceed the max characters setting are split into special `TableChunk` elements.
2424

25-
<Note>
26-
During chunking, Unstructured removes all detected `Image` elements from the output.
27-
</Note>
28-
2925
Here are a few examples:
3026

3127
```json
@@ -95,6 +91,17 @@ Here are a few examples:
9591
}
9692
```
9793

94+
If the option to include original elements is specified, during chunking the `orig_elements` field is added to the `metadata` field of each chunked element.
95+
The `orig_elements` field is a list of the original elements that were used to create the current chunked element. This list is output in
96+
compressed Base64 gzipped format. To get back to the original content for this list, Base64-decode the list's bytes, decompress them, and then decode them using UTF-8.
97+
[Learn how](/api-reference/partition/get-chunked-elements).
98+
99+
After chunking, `Image` elements are not preserved in the output. However,
100+
if High Res partitioning is used and the option to include original elements is also specified, the `orig_elements` field of each chunked element will contain
101+
an `image_base64` field for each detected image and table associated with the original elements listed within `orig_elements`. To get back to the
102+
original content for an `image_base64` field, Base64-decode the field's bytes.
103+
[Learn how](/api-reference/partition/extract-image-block-types).
104+
98105
The following sections provide information about the available chunking strategies and their settings.
99106

100107
<Note>You can change a workflow's preconfigured strategy only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings.</Note>

ui/document-elements.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,9 +103,11 @@ All file types return the following `metadata` fields when the information is av
103103
| `file_directory` | The related file's directory. |
104104
| `filename` | The related file's filename. |
105105
| `filetype` | The related file's type. |
106+
| `image_base64` | A Base64 representation of the detected image or table. Only applicable to image and table elements when High Res partitioning is used. |
106107
| `is_continuation` | True if the element is a continuation of a previous element. Only relevant for chunking, if an element was divided into two due to **Max Characters**. |
107108
| `languages` | Document languages at the file or element level. The list is ordered by probability of being the primary language of the text. |
108109
| `last_modified` | The related file's last modified date. |
110+
| `orig_elements` | For [chunked](/ui/chunking) elements, a list of the original elements that were used to create the current chunked element. |
109111
| `parent_id` | The ID of the element's parent element. `parent_id` might be used to infer where an element resides within the overall [document hierarchy](#document-hierarchy). For instance, a `NarrativeText` element might have a `Title` element as a parent (a “subtitle”), which in turn might have another `Title` element as its parent (a "title"). |
110112
| `text_as_html` | The HTML representation of the related extracted table. Only applicable to [table elements](#table-specific-metadata). |
111113

0 commit comments

Comments
 (0)