|
1 | 1 | # Batch Processing Documents |
2 | 2 |
|
3 | | -## Sample Connector: S3 |
4 | | - |
5 | | -See the sample project [examples/ingest/s3-small-batch/main.py](examples/ingest/s3-small-batch/main.py), which processes all the documents under a given s3 URL with 2 parallel processes, writing the structured json output to `structured-outputs/`. |
6 | | - |
7 | | -You can try it out with: |
8 | | - |
9 | | - PYTHONPATH=. python examples/ingest/s3-small-batch/main.py --s3-url s3://utic-dev-tech-fixtures/small-pdf-set/ --anonymous |
10 | | - |
11 | | - # Note: the --anonymous flag indicates not to provide AWS credentials, needed |
12 | | - # for the boto3 lib. Remove this flag when local AWS credentials are required. |
13 | | - |
14 | | -This utility is ready to use with any s3 prefix! |
15 | | - |
16 | | -By default, it will not reprocess files from s3 if their outputs already exist in --structured-ouput-dir. Natrually, this may come in handy when processing a large number of files. However, you can force reprocessing all documents with the --reprocess flag. |
17 | | - |
18 | | - |
19 | | -``` |
20 | | -$ PYTHONPATH=. python examples/ingest/s3-small-batch/main.py --help |
21 | | -Usage: main.py [OPTIONS] |
22 | | -
|
23 | | -Options: |
24 | | - --s3-url TEXT Prefix of s3 objects (files) to download. |
25 | | - E.g. s3://bucket1/path/. This value may also |
26 | | - be a single file. |
27 | | - --re-download / --no-re-download |
28 | | - Re-download files from s3 even if they are |
29 | | - already present in --download-dir. |
30 | | - --download-dir TEXT Where s3 files are downloaded to, defaults |
31 | | - to tmp-ingest-<6 random chars>. |
32 | | - --preserve-downloads Preserve downloaded s3 files. Otherwise each |
33 | | - file is removed after being processed |
34 | | - successfully. |
35 | | - --structured-output-dir TEXT Where to place structured output .json |
36 | | - files. |
37 | | - --reprocess Reprocess a downloaded file from s3 even if |
38 | | - the relevant structured output .json file in |
39 | | - --structured-output-dir already exists. |
40 | | - --num-processes INTEGER Number of parallel processes to process docs |
41 | | - in. [default: 2] |
42 | | - --anonymous Connect to s3 without local AWS credentials. |
43 | | - -v, --verbose |
44 | | - --help Show this message and exit. |
45 | | -``` |
46 | | - |
47 | | -# Developer notes |
48 | | - |
49 | | -## The Abstractions |
50 | | - |
51 | | -```mermaid |
52 | | -sequenceDiagram |
53 | | - participant MainProcess |
54 | | - participant DocReader (connector) |
55 | | - participant DocProcessor |
56 | | - participant StructuredDocWriter (conncector) |
57 | | - MainProcess->>DocReader (connector): Initialize / Authorize |
58 | | - DocReader (connector)->>MainProcess: All doc metadata (no file content) |
59 | | - loop Single doc at a time (allows for multiprocessing) |
60 | | - MainProcess->>DocProcessor: Raw document metadata (no file content) |
61 | | - DocProcessor->>DocReader (connector): Request document |
62 | | - DocReader (connector)->>DocProcessor: Single document payload |
63 | | - Note over DocProcessor: Process through Unstructured |
64 | | - DocProcessor->>StructuredDocWriter (conncector): Write Structured Data |
65 | | - Note over StructuredDocWriter (conncector): <br /> Optionally store version info, filename, etc |
66 | | - DocProcessor->>MainProcess: Structured Data (only JSON in V0) |
67 | | - end |
68 | | - Note over MainProcess: Optional - process structured data from all docs |
69 | | -``` |
70 | | - |
71 | | -The abstractions in the above diagram are honored in the S3 Connector project (though ABC's are not yet written), with the exception of the StructuredDocWriter which may be added more formally at a later time. |
| 3 | +## The unstructured-ingest CLI |
| 4 | + |
| 5 | +The unstructured library includes a CLI to batch ingest documents from (soon to be |
| 6 | +various) sources, storing structured outputs locally on the filesystem. |
| 7 | + |
| 8 | +For example, the following command processes all the documents in S3 in the |
| 9 | +`utic-dev-tech-fixtures` bucket with a prefix of `small-pdf-set/`. |
| 10 | + |
| 11 | + unstructured-ingest \ |
| 12 | + --s3-url s3://utic-dev-tech-fixtures/small-pdf-set/ \ |
| 13 | + --s3-anonymous \ |
| 14 | + --structured-output-dir s3-small-batch-output \ |
| 15 | + --num-processes 2 |
| 16 | + |
| 17 | +Naturally, --num-processes may be adjusted for better instance utilization with multiprocessing. |
| 18 | + |
| 19 | +Installation note: make sure to install the following extras when installing unstructured, needed for the above command: |
| 20 | + |
| 21 | + pip install "unstructured[s3,local-inference]" |
| 22 | + |
| 23 | +# Developers' Guide |
| 24 | + |
| 25 | +## Local testing |
| 26 | + |
| 27 | +When testing from a local checkout rather than a pip-installed version of `unstructured`, |
| 28 | +just execute `unstructured/ingest/main.py`, e.g.: |
| 29 | + |
| 30 | + PYTHONPATH=. ./unstructured/ingest/main.py \ |
| 31 | + --s3-url s3://utic-dev-tech-fixtures/small-pdf-set/ \ |
| 32 | + --s3-anonymous \ |
| 33 | + --structured-output-dir s3-small-batch-output \ |
| 34 | + --num-processes 2 |
| 35 | + |
| 36 | +## Adding Data Connectors |
| 37 | + |
| 38 | +To add a connector, refer to [unstructured/ingest/connector/s3_connector.py](unstructured/ingest/connector/s3_connector.py) as example that implements the three relelvant abstract base classes. |
| 39 | + |
| 40 | +Then, update [unstructured/ingest/main.py](unstructured/ingest/main.py) to instantiate |
| 41 | +the connector specific to your class if its command line options are invoked. |
| 42 | + |
| 43 | +Create at least one folder [examples/ingest](examples/ingest) with an easily reproducible |
| 44 | +script that shows the new connector in action. |
| 45 | + |
| 46 | +Finally, to ensure the connector remains stable, add a new script test_unstructured_ingest/test-ingest-\<the-new-data-source\>.sh similar to [test_unstructured_ingest/test-ingest-s3.sh](test_unstructured_ingest/test-ingest-s3.sh), and append a line invoking the new script in [test_unstructured_ingest/test-ingest.sh](test_unstructured_ingest/test-ingest.sh). |
| 47 | + |
| 48 | +You'll notice that the unstructured outputs for the new documents are expected |
| 49 | +to be checked into CI under test_unstructured_ingest/expected-structured-output/\<folder-name-relevant-to-your-dataset\>. So, you'll need to `git add` those json outputs so that `test-ingest.sh` passes in CI. |
| 50 | + |
| 51 | +The `main.py` flags of --re-download/--no-re-download , --download-dir, --preserve-downloads, --structured-output-dir, and --reprocess are honored by the connector. |
| 52 | + |
| 53 | +### The checklist: |
| 54 | + |
| 55 | +In checklist form, the above steps are summarized as: |
| 56 | + |
| 57 | +- [ ] Create a new module under [unstructured/ingest/connector/](unstructured/ingest/connector/) implementing the 3 abstract base classes, similar to [unstructured/ingest/connector/s3_connector.py](unstructured/ingest/connector/s3_connector.py). |
| 58 | +- [ ] Update [unstructured/ingest/main.py](unstructured/ingest/main.py) with support for the new connector. |
| 59 | +- [ ] Create a folder under [examples/ingest](examples/ingest) that includes at least one well documented script. |
| 60 | +- [ ] Add a script test_unstructured_ingest/test-ingest-\<the-new-data-source\>.sh. It's json output files should have a total of no more than 100K. |
| 61 | +- [ ] Git add the expected outputs under test_unstructured_ingest/expected-structured-output/\<folder-name-relevant-to-your-dataset\> so the above test passes in CI. |
| 62 | +- [ ] Add a line to [test_unstructured_ingest/test-ingest.sh](test_unstructured_ingest/test-ingest.sh) invoking the new test script. |
| 63 | +- [ ] Honors the conventions of `BaseConnectorConfig` defined in [unstructured/ingest/interfaces.py](unstructured/ingest/interfaces.py) which is passed through [the CLI](unstructured/ingest/main.py): |
| 64 | + - [ ] If running with an `.output_dir` where structured outputs already exists for a given file, the file content is not re-downloaded from the data source nor is it reprocessed. This is made possible by implementing the call to `MyIngestDoc.has_output()` which is invoked in [MainProcess._filter_docs_with_outputs](ingest-prep-for-many/unstructured/ingest/main.py). |
| 65 | + - [ ] Unless `.reprocess` is `True`, then documents are always reprocessed. |
| 66 | + - [ ] If `.preserve_download` is `True`, documents downloaded to `.download_dir` are not removed after processing. |
| 67 | + - [ ] Else if `.preserve_download` is `False`, documents downloaded to `.download_dir` are removed after they are **successfully** processed during the invocation of `MyIngestDoc.cleanup_file()` in [process_document](unstructured/ingest/doc_processor/generalized.py) |
| 68 | + - [ ] Does not re-download documents to `.download_dir` if `.re_download` is False, enforced in `MyIngestDoc.get_file()` |
| 69 | + - [ ] Prints more details if `.verbose` similar to [unstructured/ingest/connector/s3_connector.py](unstructured/ingest/connector/s3_connector.py). |
| 70 | + |
0 commit comments