Skip to content

Commit 3eaf65a

Browse files
feat: refactor ingest (#3009)
### Description This refactors the current ingest CLI process to support better granularity in how the steps are ran * Both multiprocessing and async now supported. Given that a lot of the steps are IO-bound, such as downloading and uploading content, we can achieve better parallelization by using async here * Destination step broken up into a stager step and an upload step. This will allow for steps that require manipulation of the data between formats, such as converting the elements json into a csv format to upload for tabular destinations, to be pulled out of the step that does the actual upload. * The process of writing the content to a local destination was now pulled out as it's own dedicated destination connector, meaning you no longer need to persist the content locally once the process is done if the content was uploaded elsewhere. * Quick update to the chunker/partition step to use the python client. * Move the uncompress suppport as a pipeline step since this can arbitrarily apply to any concrete files that have been downloaded, regardless of where they came from. * Leverage last modified date to mark files to be reprocessed, even if the file already exists locally. ### Callouts Retry configs haven't been moved over yet. This is an open question because the intent was for it to wrap potential connection errors but now any of the other steps that leverage an API might run into network connection issues. Should those be isolated in each of the steps and wrapped with the same retry configs? Or do we need to expose a unique retry config for each step? This would bloat the input params even more. ### Testing * If you want to run the new code as an SDK, there's an example file that was added to highlight how to do that: [example.py](https://github.com/Unstructured-IO/unstructured/blob/roman/refactor-ingest/unstructured/ingest/v2/example.py) * If you want to run the new code as an isolated CLI: ```shell PYTHONPATH=. python unstructured/ingest/v2/main.py --help ``` * If you want to see which commands have been migrated to the new version, there's now a `v2` short help text next to those commands when running the current cli: ```shell PYTHONPATH=. python unstructured/ingest/main.py --help Usage: main.py [OPTIONS] COMMAND [ARGS]...main.py --help Options: --help Show this message and exit. Commands: airtable azure biomed box confluence delta-table discord dropbox elasticsearch fsspec gcs github gitlab google-drive hubspot jira local v2 mongodb notion onedrive opensearch outlook reddit s3 v2 salesforce sftp sharepoint slack wikipedia ``` You can run any of the local or s3 specific ingest tests and these should now work. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: rbiseck3 <[email protected]>
1 parent 73739b3 commit 3eaf65a

File tree

120 files changed

+43791
-34966
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

120 files changed

+43791
-34966
lines changed

CHANGELOG.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,15 @@
1-
## 0.14.1-dev0
1+
## 0.14.1-dev1
22

33
* **Add support for Python 3.12**. `unstructured` now works with Python 3.12!
44

55
### Features
6+
* **Large improvements to the ingest process:**
7+
* Support for multiprocessing and async, with limits for both.
8+
* Streamlined to process when mapping CLI invocations to the underlying code
9+
* More granular steps introduced to give better control over process (i.e. dedicated step to uncompress files already in the local filesystem, new optional staging step before upload)
10+
* Use the python client when calling the unstructured api for partitioning or chunking
11+
* Saving the final content is now a dedicated destination connector (local) set as the default if none are provided. Avoids adding new files locally if uploading elsewhere.
12+
* Leverage last modified date when deciding if new files should be downloaded and reprocessed.
613

714
### Fixes
815

examples/ingest/chroma/ingest.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
1616
--input-path example-docs/book-war-and-peace-1p.txt \
1717
--output-dir local-to-chroma \
1818
--strategy fast \
19-
--chunk-elements \
19+
--chunking-strategy by_title \
2020
--embedding-provider "<an unstructured embedding provider, ie. langchain-huggingface>" \
2121
--num-processes 2 \
2222
--verbose \

examples/ingest/clarifai/ingest.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
1010
--input-path example-docs/book-war-and-peace-1225p.txt \
1111
--output-dir local-output-to-clarifai \
1212
--strategy fast \
13-
--chunk-elements \
13+
--chunking-strategy by_title \
1414
--num-processes 2 \
1515
--verbose \
1616
clarifai \

examples/ingest/elasticsearch/destination.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
1515
--input-path example-docs/book-war-and-peace-1225p.txt \
1616
--output-dir local-to-elasticsearch \
1717
--strategy fast \
18-
--chunk-elements \
18+
--chunking-strategy by_title \
1919
--embedding-provider "<an unstructured embedding provider, ie. langchain-huggingface>" \
2020
--num-processes 2 \
2121
--verbose \

examples/ingest/mongodb/destination.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
1515
--input-path example-docs/book-war-and-peace-1225p.txt \
1616
--output-dir local-to-mongodb \
1717
--strategy fast \
18-
--chunk-elements \
18+
--chunking-strategy by_title \
1919
--embedding-provider "<an unstructured embedding provider, ie. langchain-huggingface>" \
2020
--num-processes 2 \
2121
--verbose \

examples/ingest/opensearch/destination.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
1515
--input-path example-docs/book-war-and-peace-1225p.txt \
1616
--output-dir local-to-opensearch \
1717
--strategy fast \
18-
--chunk-elements \
18+
--chunking-strategy by_title \
1919
--embedding-provider "<an unstructured embedding provider, ie. langchain-huggingface>" \
2020
--num-processes 2 \
2121
--verbose \

examples/ingest/pinecone/ingest.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
1616
--input-path example-docs/book-war-and-peace-1225p.txt \
1717
--output-dir local-to-pinecone \
1818
--strategy fast \
19-
--chunk-elements \
19+
--chunking-strategy by_title \
2020
--embedding-provider "<an unstructured embedding provider, ie. langchain-huggingface>" \
2121
--num-processes 2 \
2222
--verbose \

examples/ingest/qdrant/ingest.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ unstructured-ingest \
1212
--input-path example-docs/book-war-and-peace-1225p.txt \
1313
--output-dir local-output-to-qdrant \
1414
--strategy fast \
15-
--chunk-elements \
15+
--chunking-strategy by_title \
1616
--embedding-provider "$EMBEDDING_PROVIDER" \
1717
--num-processes 2 \
1818
--verbose \

examples/ingest/sql/ingest.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
1212
--input-path example-docs/book-war-and-peace-1225p.txt \
1313
--output-dir local-to-pinecone \
1414
--strategy fast \
15-
--chunk-elements \
15+
--chunking-strategy by_title \
1616
--embedding-provider "<an unstructured embedding provider, ie. langchain-huggingface>" \
1717
--num-processes 2 \
1818
--verbose \

examples/ingest/weaviate/ingest.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
1616
--reprocess \
1717
--input-path example-docs/book-war-and-peace-1225p.txt \
1818
--work-dir weaviate-work-dir \
19-
--chunk-elements \
19+
--chunking-strategy by_title \
2020
--chunk-new-after-n-chars 2500 --chunk-multipage-sections \
2121
--embedding-provider "langchain-huggingface" \
2222
weaviate \

0 commit comments

Comments
 (0)