Skip to content

Commit 87fd0d0

Browse files
authored
feat: Ingest refactors, doc updates (#243)
- Creates ABC's for ingest connectors - Updates the s3_connector classes to inherit from ABC's - Moves s3 test script to it's own file to establish pattern for additional connectors - Rewrites the Ingest.md doc, including instructions how how to add a connector - Updates the example s3 ingest script to use the new location for main.py Note that there were no logic changes, this is essentially a refactoring PR. Test instructions: Run ./test_unstructured_ingest/test-ingest.sh and ./examples/ingest/s3-small-batch/ingest.sh.
1 parent 3149241 commit 87fd0d0

File tree

12 files changed

+391
-226
lines changed

12 files changed

+391
-226
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
## 0.4.12-dev0
2+
3+
* Adds console_entrypoint for unstructured-ingest
4+
15
## 0.4.11
26

37
* Adds `partition_doc` for partitioning Word documents in `.doc` format. Requires `libreoffice`.

Ingest.md

Lines changed: 68 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -1,71 +1,70 @@
11
# Batch Processing Documents
22

3-
## Sample Connector: S3
4-
5-
See the sample project [examples/ingest/s3-small-batch/main.py](examples/ingest/s3-small-batch/main.py), which processes all the documents under a given s3 URL with 2 parallel processes, writing the structured json output to `structured-outputs/`.
6-
7-
You can try it out with:
8-
9-
PYTHONPATH=. python examples/ingest/s3-small-batch/main.py --s3-url s3://utic-dev-tech-fixtures/small-pdf-set/ --anonymous
10-
11-
# Note: the --anonymous flag indicates not to provide AWS credentials, needed
12-
# for the boto3 lib. Remove this flag when local AWS credentials are required.
13-
14-
This utility is ready to use with any s3 prefix!
15-
16-
By default, it will not reprocess files from s3 if their outputs already exist in --structured-ouput-dir. Natrually, this may come in handy when processing a large number of files. However, you can force reprocessing all documents with the --reprocess flag.
17-
18-
19-
```
20-
$ PYTHONPATH=. python examples/ingest/s3-small-batch/main.py --help
21-
Usage: main.py [OPTIONS]
22-
23-
Options:
24-
--s3-url TEXT Prefix of s3 objects (files) to download.
25-
E.g. s3://bucket1/path/. This value may also
26-
be a single file.
27-
--re-download / --no-re-download
28-
Re-download files from s3 even if they are
29-
already present in --download-dir.
30-
--download-dir TEXT Where s3 files are downloaded to, defaults
31-
to tmp-ingest-<6 random chars>.
32-
--preserve-downloads Preserve downloaded s3 files. Otherwise each
33-
file is removed after being processed
34-
successfully.
35-
--structured-output-dir TEXT Where to place structured output .json
36-
files.
37-
--reprocess Reprocess a downloaded file from s3 even if
38-
the relevant structured output .json file in
39-
--structured-output-dir already exists.
40-
--num-processes INTEGER Number of parallel processes to process docs
41-
in. [default: 2]
42-
--anonymous Connect to s3 without local AWS credentials.
43-
-v, --verbose
44-
--help Show this message and exit.
45-
```
46-
47-
# Developer notes
48-
49-
## The Abstractions
50-
51-
```mermaid
52-
sequenceDiagram
53-
participant MainProcess
54-
participant DocReader (connector)
55-
participant DocProcessor
56-
participant StructuredDocWriter (conncector)
57-
MainProcess->>DocReader (connector): Initialize / Authorize
58-
DocReader (connector)->>MainProcess: All doc metadata (no file content)
59-
loop Single doc at a time (allows for multiprocessing)
60-
MainProcess->>DocProcessor: Raw document metadata (no file content)
61-
DocProcessor->>DocReader (connector): Request document
62-
DocReader (connector)->>DocProcessor: Single document payload
63-
Note over DocProcessor: Process through Unstructured
64-
DocProcessor->>StructuredDocWriter (conncector): Write Structured Data
65-
Note over StructuredDocWriter (conncector): <br /> Optionally store version info, filename, etc
66-
DocProcessor->>MainProcess: Structured Data (only JSON in V0)
67-
end
68-
Note over MainProcess: Optional - process structured data from all docs
69-
```
70-
71-
The abstractions in the above diagram are honored in the S3 Connector project (though ABC's are not yet written), with the exception of the StructuredDocWriter which may be added more formally at a later time.
3+
## The unstructured-ingest CLI
4+
5+
The unstructured library includes a CLI to batch ingest documents from (soon to be
6+
various) sources, storing structured outputs locally on the filesystem.
7+
8+
For example, the following command processes all the documents in S3 in the
9+
`utic-dev-tech-fixtures` bucket with a prefix of `small-pdf-set/`.
10+
11+
unstructured-ingest \
12+
--s3-url s3://utic-dev-tech-fixtures/small-pdf-set/ \
13+
--s3-anonymous \
14+
--structured-output-dir s3-small-batch-output \
15+
--num-processes 2
16+
17+
Naturally, --num-processes may be adjusted for better instance utilization with multiprocessing.
18+
19+
Installation note: make sure to install the following extras when installing unstructured, needed for the above command:
20+
21+
pip install "unstructured[s3,local-inference]"
22+
23+
# Developers' Guide
24+
25+
## Local testing
26+
27+
When testing from a local checkout rather than a pip-installed version of `unstructured`,
28+
just execute `unstructured/ingest/main.py`, e.g.:
29+
30+
PYTHONPATH=. ./unstructured/ingest/main.py \
31+
--s3-url s3://utic-dev-tech-fixtures/small-pdf-set/ \
32+
--s3-anonymous \
33+
--structured-output-dir s3-small-batch-output \
34+
--num-processes 2
35+
36+
## Adding Data Connectors
37+
38+
To add a connector, refer to [unstructured/ingest/connector/s3_connector.py](unstructured/ingest/connector/s3_connector.py) as example that implements the three relelvant abstract base classes.
39+
40+
Then, update [unstructured/ingest/main.py](unstructured/ingest/main.py) to instantiate
41+
the connector specific to your class if its command line options are invoked.
42+
43+
Create at least one folder [examples/ingest](examples/ingest) with an easily reproducible
44+
script that shows the new connector in action.
45+
46+
Finally, to ensure the connector remains stable, add a new script test_unstructured_ingest/test-ingest-\<the-new-data-source\>.sh similar to [test_unstructured_ingest/test-ingest-s3.sh](test_unstructured_ingest/test-ingest-s3.sh), and append a line invoking the new script in [test_unstructured_ingest/test-ingest.sh](test_unstructured_ingest/test-ingest.sh).
47+
48+
You'll notice that the unstructured outputs for the new documents are expected
49+
to be checked into CI under test_unstructured_ingest/expected-structured-output/\<folder-name-relevant-to-your-dataset\>. So, you'll need to `git add` those json outputs so that `test-ingest.sh` passes in CI.
50+
51+
The `main.py` flags of --re-download/--no-re-download , --download-dir, --preserve-downloads, --structured-output-dir, and --reprocess are honored by the connector.
52+
53+
### The checklist:
54+
55+
In checklist form, the above steps are summarized as:
56+
57+
- [ ] Create a new module under [unstructured/ingest/connector/](unstructured/ingest/connector/) implementing the 3 abstract base classes, similar to [unstructured/ingest/connector/s3_connector.py](unstructured/ingest/connector/s3_connector.py).
58+
- [ ] Update [unstructured/ingest/main.py](unstructured/ingest/main.py) with support for the new connector.
59+
- [ ] Create a folder under [examples/ingest](examples/ingest) that includes at least one well documented script.
60+
- [ ] Add a script test_unstructured_ingest/test-ingest-\<the-new-data-source\>.sh. It's json output files should have a total of no more than 100K.
61+
- [ ] Git add the expected outputs under test_unstructured_ingest/expected-structured-output/\<folder-name-relevant-to-your-dataset\> so the above test passes in CI.
62+
- [ ] Add a line to [test_unstructured_ingest/test-ingest.sh](test_unstructured_ingest/test-ingest.sh) invoking the new test script.
63+
- [ ] Honors the conventions of `BaseConnectorConfig` defined in [unstructured/ingest/interfaces.py](unstructured/ingest/interfaces.py) which is passed through [the CLI](unstructured/ingest/main.py):
64+
- [ ] If running with an `.output_dir` where structured outputs already exists for a given file, the file content is not re-downloaded from the data source nor is it reprocessed. This is made possible by implementing the call to `MyIngestDoc.has_output()` which is invoked in [MainProcess._filter_docs_with_outputs](ingest-prep-for-many/unstructured/ingest/main.py).
65+
- [ ] Unless `.reprocess` is `True`, then documents are always reprocessed.
66+
- [ ] If `.preserve_download` is `True`, documents downloaded to `.download_dir` are not removed after processing.
67+
- [ ] Else if `.preserve_download` is `False`, documents downloaded to `.download_dir` are removed after they are **successfully** processed during the invocation of `MyIngestDoc.cleanup_file()` in [process_document](unstructured/ingest/doc_processor/generalized.py)
68+
- [ ] Does not re-download documents to `.download_dir` if `.re_download` is False, enforced in `MyIngestDoc.get_file()`
69+
- [ ] Prints more details if `.verbose` similar to [unstructured/ingest/connector/s3_connector.py](unstructured/ingest/connector/s3_connector.py).
70+
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/usr/bin/env bash
2+
3+
# Processes 3 PDF's from s3://utic-dev-tech-fixtures/small-pdf-set/
4+
# through Unstructured's library in 2 processes.
5+
6+
# Structured outputs are stored in s3-small-batch-output/
7+
8+
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
9+
cd "$SCRIPT_DIR"/../../.. || exit 1
10+
11+
PYTHONPATH=. ./unstructured/ingest/main.py \
12+
--s3-url s3://utic-dev-tech-fixtures/small-pdf-set/ \
13+
--s3-anonymous \
14+
--structured-output-dir s3-small-batch-output \
15+
--num-processes 2

examples/ingest/s3-small-batch/main.py

Lines changed: 0 additions & 109 deletions
This file was deleted.

setup.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,9 @@
4646
license="Apache-2.0",
4747
packages=find_packages(),
4848
version=__version__,
49-
entry_points={},
49+
entry_points={
50+
'console_scripts': ['unstructured-ingest=unstructured.ingest.main:main'],
51+
},
5052
install_requires=[
5153
"argilla",
5254
"lxml",
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
#!/usr/bin/env bash
2+
3+
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
4+
cd "$SCRIPT_DIR"/.. || exit 1
5+
6+
if [[ "$(find test_unstructured_ingest/expected-structured-output/s3-small-batch/ -type f -size +20k | wc -l)" != 3 ]]; then
7+
echo "The test fixtures in test_unstructured_ingest/expected-structured-output/ look suspicious. At least one of the files is too small."
8+
echo "Did you overwrite test fixtures with bad outputs?"
9+
exit 1
10+
fi
11+
12+
PYTHONPATH=. ./unstructured/ingest/main.py --s3-url s3://utic-dev-tech-fixtures/small-pdf-set/ --s3-anonymous --structured-output-dir s3-small-batch-output
13+
14+
if ! diff -ru s3-small-batch-output test_unstructured_ingest/expected-structured-output/s3-small-batch ; then
15+
echo
16+
echo "There are differences from the previously checked-in structured outputs."
17+
echo
18+
echo "If these differences are acceptable, copy the outputs from"
19+
echo "s3-small-batch-output/ to test_unstructured_ingest/expected-structured-output/s3-small-batch/ after running"
20+
echo
21+
echo " PYTHONPATH=. python examples/ingest/s3-small-batch/main.py --structured-output-dir s3-small-batch-output"
22+
echo
23+
exit 1
24+
fi

test_unstructured_ingest/test-ingest.sh

Lines changed: 3 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -2,22 +2,7 @@
22

33
set -eux -o pipefail
44

5-
if [[ "$(find test_unstructured_ingest/expected-structured-output -type f -size +20k | wc -l)" != 3 ]]; then
6-
echo "The test fixtures in test_unstructured_ingest/expected-structured-output/ look suspicious. At least one of the files is too small."
7-
echo "Did you overwrite test fixtures with bad outputs?"
8-
exit 1
9-
fi
5+
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
6+
cd "$SCRIPT_DIR"/.. || exit 1
107

11-
PYTHONPATH=. python examples/ingest/s3-small-batch/main.py --anonymous --structured-output-dir s3-small-batch-output
12-
13-
if ! diff -ru s3-small-batch-output test_unstructured_ingest/expected-structured-output/s3-small-batch ; then
14-
echo
15-
echo "There are differences from the previously checked-in structured outputs."
16-
echo
17-
echo "If these differences are acceptable, copy the outputs from"
18-
echo "s3-small-batch-output/ to test_unstructured_ingest/expected-structured-output/s3-small-batch/ after running"
19-
echo
20-
echo " PYTHONPATH=. python examples/ingest/s3-small-batch/main.py --structured-output-dir s3-small-batch-output"
21-
echo
22-
exit 1
23-
fi
8+
./test_unstructured_ingest/test-ingest-s3.sh

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.4.11" # pragma: no cover
1+
__version__ = "0.4.12" # pragma: no cover

unstructured/ingest/connector/local_connector.py

Lines changed: 0 additions & 2 deletions
This file was deleted.

0 commit comments

Comments
 (0)