Skip to content

Commit 3843af6

Browse files
feat: Enable remote chunking via unstructured-ingest (#2905)
Update: The cli shell script works when sending documents to the free api, but the paid api is down, so waiting to test against it. - The first commit adds docstrings and fixes type hints. - The second commit reorganizes `test_unstructured_ingest` so it matches the structure of `unstructured/ingest`. - The third commit contains the primary changes for this PR. - The `.chunk()` method responsible for sending elements to the correct method is moved from `ChunkingConfig` to `Chunker` so that `ChunkingConfig` acts as a config object instead of containing implementation logic. `Chunker.chunk()` also now takes a json file instead of a list of elements. This is done to avoid redundant serialization if the file is to be sent to the api for chunking. --------- Co-authored-by: Ahmet Melek <[email protected]>
1 parent 2d1923a commit 3843af6

File tree

26 files changed

+429
-152
lines changed

26 files changed

+429
-152
lines changed

CHANGELOG.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,13 @@
1-
## 0.13.4-dev1
1+
## 0.13.4-dev2
22

33
### Enhancements
4-
* **Unique and deterministic hash IDs for elements** Element IDs produced by any partitioning function are now deterministic and unique at the document level by default. Before, hashes were based only on text; however, they now also take into account the element's sequence number on a page, the page's number in the document, and the document's file name.
4+
* **Unique and deterministic hash IDs for elements** Element IDs produced by any partitioning
5+
function are now deterministic and unique at the document level by default. Before, hashes were
6+
based only on text; however, they now also take into account the element's sequence number on a
7+
page, the page's number in the document, and the document's file name.
8+
* **Enable remote chunking via unstructured-ingest** Chunking using unstructured-ingest was
9+
previously limited to local chunking using the strategies `basic` and `by_title`. Remote chunking
10+
options via the API are now accessible.
511

612
### Features
713

docs/source/ingest/source_connectors.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ Source Connectors
22
=================
33

44
Connect to your favorite data storage platforms for effortless batch processing of your files.
5-
We are constantly adding new data connectors and if you don'table see your favorite platform let us know
5+
We are constantly adding new data connectors and if you don't see your favorite platform let us know
66
in our community `Slack. <https://short.unstructured.io/pzw05l7>`_
77

88
.. toctree::

test_unstructured_ingest/src/against-api.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,8 @@ PYTHONPATH=${PYTHONPATH:-.} "$RUN_SCRIPT" \
3434
--metadata-exclude coordinates,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \
3535
--partition-by-api \
3636
--strategy hi_res \
37+
--chunking-strategy by_page \
38+
--chunk-max-characters 10000 \
3739
--pdf-infer-table-structure \
3840
--reprocess \
3941
--output-dir "$OUTPUT_DIR" \
File renamed without changes.

test_unstructured_ingest/unit/test_connector_gcs.py renamed to test_unstructured_ingest/unit/connector/fsspec/test_connector_gcs.py

File renamed without changes.
File renamed without changes.
File renamed without changes.

test_unstructured_ingest/unit/test_connector_git.py renamed to test_unstructured_ingest/unit/connector/test_connector_git.py

File renamed without changes.

test_unstructured_ingest/unit/test_salesforce_connector.py renamed to test_unstructured_ingest/unit/connector/test_salesforce_connector.py

File renamed without changes.

test_unstructured_ingest/unit/test_serialization.py renamed to test_unstructured_ingest/unit/connector/test_serialization.py

File renamed without changes.

0 commit comments

Comments
 (0)