Skip to content

Commit 56fbaae

Browse files
authored
feat(chunking): add metadata.orig_elements serde (#2680)
**Summary** This final PR in the "orig_elements" series adds the needful such that `.metadata.orig_elements`, when present on a chunk (element), is serialized to JSON when the chunk is serialized, for instance, to be used in an HTTP response payload. It also provides for deserializing such a JSON payload into chunks that contain the `.orig_elements` metadata. **Additional Context** Note that `.metadata.orig_elements` is always `Optional[list[Element]]` when in memory. However, those original elements are serialized as Base64-encoded gzipped JSON and are in that form (str) when present as JSON or as "element-dicts" which is an intermediate serialization/deserialization format. That is, serialization is `Element -> dict -> JSON` and deserialization is `JSON -> dict -> Element` and `.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms. --------- Co-authored-by: scanny <[email protected]>
1 parent fd8b682 commit 56fbaae

File tree

9 files changed

+265
-69
lines changed

9 files changed

+265
-69
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
* **Add `.metadata.is_continuation` to text-split chunks.** `.metadata.is_continuation=True` is added to second-and-later chunks formed by text-splitting an oversized `Table` element but not to their counterpart `Text` element splits. Add this indicator for `CompositeElement` to allow text-split continuation chunks to be identified for downstream processes that may wish to skip intentionally redundant metadata values in continuation chunks.
66
* **Add `compound_structure_acc` metric to table eval.** Add a new property to `unstructured.metrics.table_eval.TableEvaluation`: `composite_structure_acc`, which is computed from the element level row and column index and content accuracy scores
7+
* **Add `.metadata.orig_elements` to chunks.** `.metadata.orig_elements: list[Element]` is added to chunks during the chunking process (when requested) to allow access to information from the elements each chunk was formed from. This is useful for example to recover metadata fields that cannot be consolidated to a single value for a chunk, like `page_number`, `coordinates`, and `image_base64`.
78

89
### Features
910

docs/source/apis/api_parameters.rst

Lines changed: 57 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,12 @@ encoding
3939
- **Description**: The encoding method used to decode the text input. Default: utf-8.
4040
- **Example**: utf-8
4141

42+
extract_image_block_types
43+
-------------------------
44+
- **Type**: array
45+
- **Description**: The types of image blocks to extract from the document. Supports various Element types.
46+
- **Example**: ['Image', 'Table']
47+
4248
hi_res_model_name
4349
-----------------
4450
- **Type**: string
@@ -48,7 +54,8 @@ hi_res_model_name
4854
include_page_breaks
4955
-------------------
5056
- **Type**: boolean
51-
- **Description**: If True, the output will include page breaks if the filetype supports it. Default: false.
57+
- **Description**: When true, the output will include page break elements when the filetype supports
58+
it. Default: false.
5259

5360
languages
5461
---------
@@ -72,37 +79,66 @@ xml_keep_tags
7279
- **Type**: boolean
7380
- **Description**: If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to partition_xml.
7481

82+
83+
Chunking Parameters
84+
-------------------
85+
86+
The following parameters control chunking behavior. Chunking is automatically performed after
87+
partitioning when a value is provided for the ``chunking_strategy`` argument. The remaining chunking
88+
parameters are only operative when a chunking strategy is specified. Note that not all chunking
89+
parameters apply to all chunking strategies. Any chunking arguments not supported by the selected
90+
chunker are ignored.
91+
7592
chunking_strategy
7693
-----------------
7794
- **Type**: string
78-
- **Description**: Use one of the supported strategies to chunk the returned elements. Currently supports: by_title.
79-
- **Example**: by_title
80-
81-
multipage_sections
82-
------------------
83-
- **Type**: boolean
84-
- **Description**: If chunking strategy is set, determines if sections can span multiple sections. Default: true.
95+
- **Description**: Use one of the supported strategies to chunk the returned elements. When omitted,
96+
no chunking is performed and any other chunking parameters provided are ignored.
97+
- **Valid values**: ``"basic"``, ``"by_title"``
8598

8699
combine_under_n_chars
87100
---------------------
88101
- **Type**: integer
89-
- **Description**: If chunking strategy is set, combine elements until a section reaches a length of n chars. Default: 500.
102+
- **Applicable Chunkers**: "by_title" only
103+
- **Description**: When chunking strategy is set to "by_title", combine small chunks until the
104+
combined chunk reaches a length of n chars. This can mitigate the appearance of small chunks
105+
created by short paragraphs, not intended as section headings, being identified as ``Title``
106+
elements in certain documents.
107+
- **Default**: the same value as ``max_characters``
90108
- **Example**: 500
91109

92-
new_after_n_chars
93-
-----------------
94-
- **Type**: integer
95-
- **Description**: If chunking strategy is set, cut off new sections after reaching a length of n chars (soft max). Default: 1500.
96-
- **Example**: 1500
110+
include_orig_elements
111+
---------------------
112+
- **Type**: boolean
113+
- **Applicable Chunkers**: All
114+
- **Description**: Add the elements used to form each chunk to ``.metadata.orig_elements`` for that
115+
chunk. These can be used to recover the original text and metadata for individual elements when
116+
that is required, for example to identify the page-numbers or coordinates spanned by a chunk.
117+
When an element larger than ``max_characters`` is divided into two or more chunks via
118+
text-splitting, each of those chunks will contain the entire original chunk as the only item in
119+
its ``.metadata.orig_elements`` list.
120+
- **Default**: true
97121

98122
max_characters
99123
--------------
100124
- **Type**: integer
101-
- **Description**: If chunking strategy is set, cut off new sections after reaching a length of n chars (hard max). Default: 1500.
102-
- **Example**: 1500
125+
- **Applicable Chunkers**: All
126+
- **Description**: When chunking strategy is set, cut off new chunks after reaching a length of n
127+
chars (hard max).
128+
- **Default**: 500
103129

104-
extract_image_block_types
105-
-------------------------
106-
- **Type**: array
107-
- **Description**: The types of image blocks to extract from the document. Supports various Element types.
108-
- **Example**: ['Image', 'Table']
130+
multipage_sections
131+
------------------
132+
- **Type**: boolean
133+
- **Applicable Chunkers**: "by_title" only
134+
- **Description**: When true and chunking strategy is set to "by_title", allows a chunk to include
135+
elements from more than one page. Otherwise chunks are broken on page boundaries.
136+
- **Default**: true
137+
138+
new_after_n_chars
139+
-----------------
140+
- **Type**: integer
141+
- **Applicable Chunkers**: "basic", "by_title"
142+
- **Description**: When chunking strategy is set, cut off new chunk after reaching a length of n
143+
chars (soft max).
144+
- **Default**: 1500

docs/source/core/chunking.rst

Lines changed: 44 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -65,8 +65,8 @@ be specified when a non-default setting is required. Specific chunking strategie
6565
need to decide based on your use-case whether this option is right for you.
6666

6767

68-
Chunking elements
69-
-----------------
68+
Chunking
69+
--------
7070

7171
Chunking can be performed as part of partitioning or as a separate step after
7272
partitioning:
@@ -170,3 +170,45 @@ following behaviors:
170170
``combine_text_under_n_chars`` argument. This defaults to the same value as ``max_characters``
171171
such that sequential small sections are combined to maximally fill the chunking window. Setting
172172
this to ``0`` will disable section combining.
173+
174+
175+
Recovering Chunk Elements
176+
-------------------------
177+
178+
In general, a chunk consolidates multiple document elements to maximally fill a chunk of the desired
179+
size. Information is naturally lost in this consolidation, for example which element a portion of
180+
the text came from and certain metadata like page-number and coordinates which cannot always be
181+
resolved to a single value.
182+
183+
The original elements combined to make a chunk can be accessed using the `.metadata.orig_elements`
184+
field on the chunk:
185+
186+
.. code:: python
187+
188+
>>> elements = [
189+
... Title("Lorem Ipsum"),
190+
... NarrativeText("Lorem ipsum dolor sit."),
191+
... ]
192+
>>> chunk = chunk_elements(elements)[0]
193+
>>> print(chunk.text)
194+
'Lorem Ipsum\n\nLorem ipsum dolor sit.'
195+
>>> print(chunk.metadata.orig_elements)
196+
[Title("Lorem Ipsum"), NarrativeText("Lorem ipsum dolor sit.")]
197+
198+
These elements will contain all their original metadata so can be used to access metadata that
199+
cannot reliably be consolidated, for example:
200+
201+
--code:: python
202+
203+
>>> {e.metadata.page_number for e in chunk.metadata.orig_elements}
204+
{2, 3}
205+
206+
>>> [e.metadata.coordinates for e in chunk.metadata.orig_elements]
207+
[<CoordinatesMetadata ...>, <CoordinatesMetadata ...>, ...]
208+
209+
>>> [
210+
e.metadata.image_path
211+
for e in chunk.metadata.orig_elements
212+
if e.metadata.image_path is not None
213+
]
214+
['/tmp/lorem.jpg', '/tmp/ipsum.png']

docs/source/metadata.rst

Lines changed: 31 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,9 @@ Metadata
77
========
88

99
The ``unstructured`` package tracks a variety of metadata about Elements extracted from documents.
10-
Tracking metadata enables users to filter document elements downstream based on element metadata of interest.
11-
For example, a user may be interested in selected document elements from a given page number
12-
or an e-mail with a given subject line.
10+
Element metadata has a variety of uses including:
11+
* filtering document elements based on an element metadata value, for example, elements from a given page number or an e-mail with a subject matching a regular expression.
12+
* mapping an element to the document page where it occurred so that original page can be retrieved when that element matches search criteria.
1313

1414
Metadata is tracked at the element level. You can extract the metadata for a given document element
1515
with ``element.metadata``. For a dictionary representation, use ``element.metadata.to_dict()``.
@@ -136,34 +136,34 @@ returned. If the ``in_place`` flag is ``False``, only the altered coordinates ar
136136
Additional Metadata Fields by Document Type
137137
###########################################
138138

139-
+-------------------------+---------------------+--------------------------------------------------------+
140-
| Field Name | Applicable Doc Types| Short Description |
141-
+=========================+=====================+========================================================+
142-
| page_number | DOCX,PDF, PPT,XLSX | Page Number |
143-
+-------------------------+---------------------+--------------------------------------------------------+
144-
| page_name | XLSX | Sheet Name in Excel document |
145-
+-------------------------+---------------------+--------------------------------------------------------+
146-
| sent_from | EML | Email Sender |
147-
+-------------------------+---------------------+--------------------------------------------------------+
148-
| sent_to | EML | Email Recipient |
149-
+-------------------------+---------------------+--------------------------------------------------------+
150-
| subject | EML | Email Subject |
151-
+-------------------------+---------------------+--------------------------------------------------------+
152-
| attached_to_filename | MSG | filename that attachment file is attached to |
153-
+-------------------------+---------------------+--------------------------------------------------------+
154-
| header_footer_type | Word Doc | Pages a header or footer applies to: "primary", |
155-
| | | "even_only", and "first_page" |
156-
+-------------------------+---------------------+--------------------------------------------------------+
157-
| link_urls | HTML | The url associated with a link in a document. |
158-
+-------------------------+---------------------+--------------------------------------------------------+
159-
| link_texts | HTML | The text associated with a link in a document. |
160-
+-------------------------+---------------------+--------------------------------------------------------+
161-
| links | HTML | List of {”text”: “<the text>, “url”: <the url>} items. |
162-
| | | Note: this element will be removed in the near future |
163-
| | | in favor of the above link_urls and link_texts. |
164-
+-------------------------+---------------------+--------------------------------------------------------+
165-
| section | EPUB | Book section title corresponding to table of contents |
166-
+-------------------------+---------------------+--------------------------------------------------------+
139+
+-------------------------+-----------------------+--------------------------------------------------------+
140+
| Field Name | Applicable Doc Types | Short Description |
141+
+=========================+=======================+========================================================+
142+
| page_number | DOCX, PDF, PPT, XLSX | Page Number |
143+
+-------------------------+-----------------------+--------------------------------------------------------+
144+
| page_name | XLSX | Sheet Name in Excel document |
145+
+-------------------------+-----------------------+--------------------------------------------------------+
146+
| sent_from | EML | Email Sender |
147+
+-------------------------+-----------------------+--------------------------------------------------------+
148+
| sent_to | EML | Email Recipient |
149+
+-------------------------+-----------------------+--------------------------------------------------------+
150+
| subject | EML | Email Subject |
151+
+-------------------------+-----------------------+--------------------------------------------------------+
152+
| attached_to_filename | MSG | filename that attachment file is attached to |
153+
+-------------------------+-----------------------+--------------------------------------------------------+
154+
| header_footer_type | Word Doc | Pages a header or footer applies to: "primary", |
155+
| | | "even_only", and "first_page" |
156+
+-------------------------+-----------------------+--------------------------------------------------------+
157+
| link_urls | HTML | The url associated with a link in a document. |
158+
+-------------------------+-----------------------+--------------------------------------------------------+
159+
| link_texts | HTML | The text associated with a link in a document. |
160+
+-------------------------+-----------------------+--------------------------------------------------------+
161+
| links | HTML | List of {”text”: “<the text>, “url”: <the url>} items. |
162+
| | | Note: this element will be removed in the near future |
163+
| | | in favor of the above link_urls and link_texts. |
164+
+-------------------------+-----------------------+--------------------------------------------------------+
165+
| section | EPUB | Book section title corresponding to table of contents |
166+
+-------------------------+-----------------------+--------------------------------------------------------+
167167

168168
:raw-html:`<br />`
169169
Notes on additional metadata by document type:

test_unstructured/documents/test_elements.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727
Points,
2828
RegexMetadata,
2929
Text,
30+
Title,
3031
)
3132

3233

@@ -381,6 +382,22 @@ def and_it_serializes_a_data_source_sub_object_to_a_dict_when_it_is_present(self
381382
"page_number": 2,
382383
}
383384

385+
def and_it_serializes_an_orig_elements_sub_object_to_base64_when_it_is_present(self):
386+
meta = ElementMetadata(
387+
category_depth=1,
388+
orig_elements=[Title("Lorem"), Text("Lorem Ipsum")],
389+
page_number=2,
390+
)
391+
assert meta.to_dict() == {
392+
"category_depth": 1,
393+
"orig_elements": (
394+
"eJyFzcsKwjAQheFXKVm7yDS3xjcQXNaViKTJjBR6o46glr67zVI3Lmf4Dv95EdhhjwNf2yT2hYDGUaWt"
395+
"JVm5WDoqNUL0UoJrqtLHJHaF6JFDChw2v6zbzfjkvD2OM/YZ8GvC/Khb7lBs5LcilUwRyCsblQYTiBQp"
396+
"ZRxYZcCA/1spDtP98dU6DTEw3sa5fWOqs10vH0cLQn0="
397+
),
398+
"page_number": 2,
399+
}
400+
384401
def but_unlike_in_ElementMetadata_unknown_fields_in_sub_objects_are_ignored(self):
385402
"""Metadata sub-objects ignore fields they do not explicitly define.
386403

test_unstructured/staging/test_base.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,28 @@
3131
from unstructured.staging import base
3232

3333

34+
def test_base64_gzipped_json_to_elements_can_deserialize_compressed_elements_from_a_JSON_string():
35+
base64_elements_str = (
36+
"eJyFzcsKwjAQheFXKVm7yDS3xjcQXNaViKTJjBR6o46glr67zVI3Lmf4Dv95EdhhjwNf2yT2hYDGUaWtJVm5WDoq"
37+
"NUL0UoJrqtLHJHaF6JFDChw2v6zbzfjkvD2OM/YZ8GvC/Khb7lBs5LcilUwRyCsblQYTiBQpZRxYZcCA/1spDtP9"
38+
"8dU6DTEw3sa5fWOqs10vH0cLQn0="
39+
)
40+
41+
elements = base.elements_from_base64_gzipped_json(base64_elements_str)
42+
43+
assert elements == [Title("Lorem"), Text("Lorem Ipsum")]
44+
45+
46+
def test_elements_to_base64_gzipped_json_can_serialize_elements_to_a_base64_str():
47+
elements = [Title("Lorem"), Text("Lorem Ipsum")]
48+
49+
assert base.elements_to_base64_gzipped_json(elements) == (
50+
"eJyFzcsKwjAQheFXKVm7yDS3xjcQXNaViKTJjBR6o46glr67zVI3Lmf4Dv95EdhhjwNf2yT2hYDGUaWtJVm5WDoq"
51+
"NUL0UoJrqtLHJHaF6JFDChw2v6zbzfjkvD2OM/YZ8GvC/Khb7lBs5LcilUwRyCsblQYTiBQpZRxYZcCA/1spDtP9"
52+
"8dU6DTEw3sa5fWOqs10vH0cLQn0="
53+
)
54+
55+
3456
def test_elements_to_dicts():
3557
elements = [Title(text="Title 1"), NarrativeText(text="Narrative 1")]
3658
isd = base.elements_to_dicts(elements)

0 commit comments

Comments
 (0)