feat(chunking): add metadata.orig_elements serde (#2680)

scanny · web-flow · commit 56fbaaed1098 · 2024-03-22T21:53:26.000Z
**Summary**
This final PR in the "orig_elements" series adds the needful such that
`.metadata.orig_elements`, when present on a chunk (element), is
serialized to JSON when the chunk is serialized, for instance, to be
used in an HTTP response payload.

It also provides for deserializing such a JSON payload into chunks that
contain the `.orig_elements` metadata.

**Additional Context**
Note that `.metadata.orig_elements` is always `Optional[list[Element]]`
when in memory. However, those original elements are serialized as
Base64-encoded gzipped JSON and are in that form (str) when present as
JSON or as "element-dicts" which is an intermediate
serialization/deserialization format. That is, serialization is `Element
-&gt; dict -&gt; JSON` and deserialization is `JSON -&gt; dict -&gt; Element` and
`.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms.

---------

Co-authored-by: scanny &lt;scanny@users.noreply.github.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,7 @@
 
 * **Add `.metadata.is_continuation` to text-split chunks.** `.metadata.is_continuation=True` is added to second-and-later chunks formed by text-splitting an oversized `Table` element but not to their counterpart `Text` element splits. Add this indicator for `CompositeElement` to allow text-split continuation chunks to be identified for downstream processes that may wish to skip intentionally redundant metadata values in continuation chunks.
 * **Add `compound_structure_acc` metric to table eval.** Add a new property to `unstructured.metrics.table_eval.TableEvaluation`: `composite_structure_acc`, which is computed from the element level row and column index and content accuracy scores
+* **Add `.metadata.orig_elements` to chunks.** `.metadata.orig_elements: list[Element]` is added to chunks during the chunking process (when requested) to allow access to information from the elements each chunk was formed from. This is useful for example to recover metadata fields that cannot be consolidated to a single value for a chunk, like `page_number`, `coordinates`, and `image_base64`.
 
 ### Features
 
diff --git a/docs/source/apis/api_parameters.rst b/docs/source/apis/api_parameters.rst
@@ -39,6 +39,12 @@ encoding
 - **Description**: The encoding method used to decode the text input. Default: utf-8.
 - **Example**: utf-8
 
+extract_image_block_types
+-------------------------
+- **Type**: array
+- **Description**: The types of image blocks to extract from the document. Supports various Element types.
+- **Example**: ['Image', 'Table']
+
 hi_res_model_name
 -----------------
 - **Type**: string
@@ -48,7 +54,8 @@ hi_res_model_name
 include_page_breaks
 -------------------
 - **Type**: boolean
-- **Description**: If True, the output will include page breaks if the filetype supports it. Default: false.
+- **Description**: When true, the output will include page break elements when the filetype supports
+  it. Default: false.
 
 languages
 ---------
@@ -72,37 +79,66 @@ xml_keep_tags
 - **Type**: boolean
 - **Description**: If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to partition_xml.
 
+
+Chunking Parameters
+-------------------
+
+The following parameters control chunking behavior. Chunking is automatically performed after
+partitioning when a value is provided for the ``chunking_strategy`` argument. The remaining chunking
+parameters are only operative when a chunking strategy is specified. Note that not all chunking
+parameters apply to all chunking strategies. Any chunking arguments not supported by the selected
+chunker are ignored.
+
 chunking_strategy
 -----------------
 - **Type**: string
-- **Description**: Use one of the supported strategies to chunk the returned elements. Currently supports: by_title.
-- **Example**: by_title
-
-multipage_sections
-------------------
-- **Type**: boolean
-- **Description**: If chunking strategy is set, determines if sections can span multiple sections. Default: true.
+- **Description**: Use one of the supported strategies to chunk the returned elements. When omitted,
+  no chunking is performed and any other chunking parameters provided are ignored.
+- **Valid values**: ``"basic"``, ``"by_title"``
 
 combine_under_n_chars
 ---------------------
 - **Type**: integer
-- **Description**: If chunking strategy is set, combine elements until a section reaches a length of n chars. Default: 500.
+- **Applicable Chunkers**: "by_title" only
+- **Description**: When chunking strategy is set to "by_title", combine small chunks until the
+  combined chunk reaches a length of n chars. This can mitigate the appearance of small chunks
+  created by short paragraphs, not intended as section headings, being identified as ``Title``
+  elements in certain documents.
+- **Default**: the same value as ``max_characters``
 - **Example**: 500
 
-new_after_n_chars
------------------
-- **Type**: integer
-- **Description**: If chunking strategy is set, cut off new sections after reaching a length of n chars (soft max). Default: 1500.
-- **Example**: 1500
+include_orig_elements
+---------------------
+- **Type**: boolean
+- **Applicable Chunkers**: All
+- **Description**: Add the elements used to form each chunk to ``.metadata.orig_elements`` for that
+  chunk. These can be used to recover the original text and metadata for individual elements when
+  that is required, for example to identify the page-numbers or coordinates spanned by a chunk.
+  When an element larger than ``max_characters`` is divided into two or more chunks via
+  text-splitting, each of those chunks will contain the entire original chunk as the only item in
+  its ``.metadata.orig_elements`` list.
+- **Default**: true
 
 max_characters
 --------------
 - **Type**: integer
-- **Description**: If chunking strategy is set, cut off new sections after reaching a length of n chars (hard max). Default: 1500.
-- **Example**: 1500
+- **Applicable Chunkers**: All
+- **Description**: When chunking strategy is set, cut off new chunks after reaching a length of n
+  chars (hard max).
+- **Default**: 500
 
-extract_image_block_types
--------------------------
-- **Type**: array
-- **Description**: The types of image blocks to extract from the document. Supports various Element types.
-- **Example**: ['Image', 'Table']
+multipage_sections
+------------------
+- **Type**: boolean
+- **Applicable Chunkers**: "by_title" only
+- **Description**: When true and chunking strategy is set to "by_title", allows a chunk to include
+  elements from more than one page. Otherwise chunks are broken on page boundaries.
+- **Default**: true
+
+new_after_n_chars
+-----------------
+- **Type**: integer
+- **Applicable Chunkers**: "basic", "by_title"
+- **Description**: When chunking strategy is set, cut off new chunk after reaching a length of n
+  chars (soft max).
+- **Default**: 1500
diff --git a/docs/source/core/chunking.rst b/docs/source/core/chunking.rst
@@ -65,8 +65,8 @@ be specified when a non-default setting is required. Specific chunking strategie
   need to decide based on your use-case whether this option is right for you.
 
 
-Chunking elements
------------------
+Chunking
+--------
 
 Chunking can be performed as part of partitioning or as a separate step after
 partitioning:
@@ -170,3 +170,45 @@ following behaviors:
   ``combine_text_under_n_chars`` argument. This defaults to the same value as ``max_characters``
   such that sequential small sections are combined to maximally fill the chunking window. Setting
   this to ``0`` will disable section combining.
+
+
+Recovering Chunk Elements
+-------------------------
+
+In general, a chunk consolidates multiple document elements to maximally fill a chunk of the desired
+size. Information is naturally lost in this consolidation, for example which element a portion of
+the text came from and certain metadata like page-number and coordinates which cannot always be
+resolved to a single value.
+
+The original elements combined to make a chunk can be accessed using the `.metadata.orig_elements`
+field on the chunk:
+
+.. code:: python
+
+    >>> elements = [
+    ...     Title("Lorem Ipsum"),
+    ...     NarrativeText("Lorem ipsum dolor sit."),
+    ... ]
+    >>> chunk = chunk_elements(elements)[0]
+    >>> print(chunk.text)
+    'Lorem Ipsum\n\nLorem ipsum dolor sit.'
+    >>> print(chunk.metadata.orig_elements)
+    [Title("Lorem Ipsum"), NarrativeText("Lorem ipsum dolor sit.")]
+
+These elements will contain all their original metadata so can be used to access metadata that
+cannot reliably be consolidated, for example:
+
+--code:: python
+
+    >>> {e.metadata.page_number for e in chunk.metadata.orig_elements}
+    {2, 3}
+
+    >>> [e.metadata.coordinates for e in chunk.metadata.orig_elements]
+    [<CoordinatesMetadata ...>, <CoordinatesMetadata ...>, ...]
+
+    >>> [
+        e.metadata.image_path
+        for e in chunk.metadata.orig_elements
+        if e.metadata.image_path is not None
+    ]
+    ['/tmp/lorem.jpg', '/tmp/ipsum.png']
diff --git a/docs/source/metadata.rst b/docs/source/metadata.rst
@@ -7,9 +7,9 @@ Metadata
 ========
 
 The ``unstructured`` package tracks a variety of metadata about Elements extracted from documents.
-Tracking metadata enables users to filter document elements downstream based on element metadata of interest.
-For example, a user may be interested in selected document elements from a given page number
-or an e-mail with a given subject line.
+Element metadata has a variety of uses including:
+* filtering document elements based on an element metadata value, for example, elements from a given page number or an e-mail with a subject matching a regular expression.
+* mapping an element to the document page where it occurred so that original page can be retrieved when that element matches search criteria.
 
 Metadata is tracked at the element level. You can extract the metadata for a given document element
 with ``element.metadata``. For a dictionary representation, use ``element.metadata.to_dict()``.
@@ -136,34 +136,34 @@ returned. If the ``in_place`` flag is ``False``, only the altered coordinates ar
 Additional Metadata Fields by Document Type
 ###########################################
 
-+-------------------------+---------------------+--------------------------------------------------------+
-| Field Name              | Applicable Doc Types| Short Description                                      |
-+=========================+=====================+========================================================+
-| page_number             | DOCX,PDF, PPT,XLSX  | Page Number                                            |
-+-------------------------+---------------------+--------------------------------------------------------+
-| page_name               | XLSX                | Sheet Name in Excel document                           |
-+-------------------------+---------------------+--------------------------------------------------------+
-| sent_from               | EML                 | Email Sender                                           |
-+-------------------------+---------------------+--------------------------------------------------------+
-| sent_to                 | EML                 | Email Recipient                                        |
-+-------------------------+---------------------+--------------------------------------------------------+
-| subject                 | EML                 | Email Subject                                          |
-+-------------------------+---------------------+--------------------------------------------------------+
-| attached_to_filename    | MSG                 | filename that attachment file is attached to           |
-+-------------------------+---------------------+--------------------------------------------------------+
-| header_footer_type      | Word Doc            | Pages a header or footer applies to: "primary",        |
-|                         |                     | "even_only", and "first_page"                          |
-+-------------------------+---------------------+--------------------------------------------------------+
-| link_urls               | HTML                | The url associated with a link in a document.          |
-+-------------------------+---------------------+--------------------------------------------------------+
-| link_texts              | HTML                | The text associated with a link in a document.         |
-+-------------------------+---------------------+--------------------------------------------------------+
-| links                   | HTML                | List of {”text”: “<the text>, “url”: <the url>} items. |
-|                         |                     | Note: this element will be removed in the near future  |
-|                         |                     | in favor of the above link_urls and link_texts.        |
-+-------------------------+---------------------+--------------------------------------------------------+
-| section                 | EPUB                | Book section title corresponding to table of contents  |
-+-------------------------+---------------------+--------------------------------------------------------+
++-------------------------+-----------------------+--------------------------------------------------------+
+| Field Name              | Applicable Doc Types  | Short Description                                      |
++=========================+=======================+========================================================+
+| page_number             | DOCX, PDF, PPT, XLSX  | Page Number                                            |
++-------------------------+-----------------------+--------------------------------------------------------+
+| page_name               | XLSX                  | Sheet Name in Excel document                           |
++-------------------------+-----------------------+--------------------------------------------------------+
+| sent_from               | EML                   | Email Sender                                           |
++-------------------------+-----------------------+--------------------------------------------------------+
+| sent_to                 | EML                   | Email Recipient                                        |
++-------------------------+-----------------------+--------------------------------------------------------+
+| subject                 | EML                   | Email Subject                                          |
++-------------------------+-----------------------+--------------------------------------------------------+
+| attached_to_filename    | MSG                   | filename that attachment file is attached to           |
++-------------------------+-----------------------+--------------------------------------------------------+
+| header_footer_type      | Word Doc              | Pages a header or footer applies to: "primary",        |
+|                         |                       | "even_only", and "first_page"                          |
++-------------------------+-----------------------+--------------------------------------------------------+
+| link_urls               | HTML                  | The url associated with a link in a document.          |
++-------------------------+-----------------------+--------------------------------------------------------+
+| link_texts              | HTML                  | The text associated with a link in a document.         |
++-------------------------+-----------------------+--------------------------------------------------------+
+| links                   | HTML                  | List of {”text”: “<the text>, “url”: <the url>} items. |
+|                         |                       | Note: this element will be removed in the near future  |
+|                         |                       | in favor of the above link_urls and link_texts.        |
++-------------------------+-----------------------+--------------------------------------------------------+
+| section                 | EPUB                  | Book section title corresponding to table of contents  |
++-------------------------+-----------------------+--------------------------------------------------------+
 
 :raw-html:`<br />`
 Notes on additional metadata by document type:
diff --git a/test_unstructured/documents/test_elements.py b/test_unstructured/documents/test_elements.py
@@ -27,6 +27,7 @@
     Points,
     RegexMetadata,
     Text,
+    Title,
 )
 
 
@@ -381,6 +382,22 @@ def and_it_serializes_a_data_source_sub_object_to_a_dict_when_it_is_present(self
             "page_number": 2,
         }
 
+    def and_it_serializes_an_orig_elements_sub_object_to_base64_when_it_is_present(self):
+        meta = ElementMetadata(
+            category_depth=1,
+            orig_elements=[Title("Lorem"), Text("Lorem Ipsum")],
+            page_number=2,
+        )
+        assert meta.to_dict() == {
+            "category_depth": 1,
+            "orig_elements": (
+                "eJyFzcsKwjAQheFXKVm7yDS3xjcQXNaViKTJjBR6o46glr67zVI3Lmf4Dv95EdhhjwNf2yT2hYDGUaWt"
+                "JVm5WDoqNUL0UoJrqtLHJHaF6JFDChw2v6zbzfjkvD2OM/YZ8GvC/Khb7lBs5LcilUwRyCsblQYTiBQp"
+                "ZRxYZcCA/1spDtP98dU6DTEw3sa5fWOqs10vH0cLQn0="
+            ),
+            "page_number": 2,
+        }
+
     def but_unlike_in_ElementMetadata_unknown_fields_in_sub_objects_are_ignored(self):
         """Metadata sub-objects ignore fields they do not explicitly define.
 
diff --git a/test_unstructured/staging/test_base.py b/test_unstructured/staging/test_base.py
@@ -31,6 +31,28 @@
 from unstructured.staging import base
 
 
+def test_base64_gzipped_json_to_elements_can_deserialize_compressed_elements_from_a_JSON_string():
+    base64_elements_str = (
+        "eJyFzcsKwjAQheFXKVm7yDS3xjcQXNaViKTJjBR6o46glr67zVI3Lmf4Dv95EdhhjwNf2yT2hYDGUaWtJVm5WDoq"
+        "NUL0UoJrqtLHJHaF6JFDChw2v6zbzfjkvD2OM/YZ8GvC/Khb7lBs5LcilUwRyCsblQYTiBQpZRxYZcCA/1spDtP9"
+        "8dU6DTEw3sa5fWOqs10vH0cLQn0="
+    )
+
+    elements = base.elements_from_base64_gzipped_json(base64_elements_str)
+
+    assert elements == [Title("Lorem"), Text("Lorem Ipsum")]
+
+
+def test_elements_to_base64_gzipped_json_can_serialize_elements_to_a_base64_str():
+    elements = [Title("Lorem"), Text("Lorem Ipsum")]
+
+    assert base.elements_to_base64_gzipped_json(elements) == (
+        "eJyFzcsKwjAQheFXKVm7yDS3xjcQXNaViKTJjBR6o46glr67zVI3Lmf4Dv95EdhhjwNf2yT2hYDGUaWtJVm5WDoq"
+        "NUL0UoJrqtLHJHaF6JFDChw2v6zbzfjkvD2OM/YZ8GvC/Khb7lBs5LcilUwRyCsblQYTiBQpZRxYZcCA/1spDtP9"
+        "8dU6DTEw3sa5fWOqs10vH0cLQn0="
+    )
+
+
 def test_elements_to_dicts():
     elements = [Title(text="Title 1"), NarrativeText(text="Narrative 1")]
     isd = base.elements_to_dicts(elements)
diff --git a/test_unstructured_ingest/expected-structured-output/local-single-file-basic-chunking/handbook-1p.docx.json b/test_unstructured_ingest/expected-structured-output/local-single-file-basic-chunking/handbook-1p.docx.json
diff --git a/unstructured/documents/elements.py b/unstructured/documents/elements.py
diff --git a/unstructured/staging/base.py b/unstructured/staging/base.py