Skip to content

Commit fb5b3b6

Browse files
authored
Delta Table v2 API destination connector (#301)
1 parent f030caf commit fb5b3b6

File tree

9 files changed

+208
-37
lines changed

9 files changed

+208
-37
lines changed

api-reference/ingest/destination-connector/delta-table.mdx

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,24 @@
22
title: Delta Table
33
---
44

5-
import SharedDeltaTable from '/snippets/dc-shared-text/delta-table.mdx';
5+
import NewDocument from '/snippets/general-shared-text/new-document.mdx';
66

7-
<SharedDeltaTable />
7+
<NewDocument />
8+
9+
import SharedContentDeltaTable from '/snippets/dc-shared-text/delta-table-cli-api.mdx';
10+
import SharedAPIKeyURL from '/snippets/general-shared-text/api-key-url.mdx';
11+
12+
<SharedContentDeltaTable/>
13+
<SharedAPIKeyURL/>
14+
15+
Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector:
16+
17+
import DeltaTableAPISh from '/snippets/destination_connectors/delta_table.sh.mdx';
18+
import DeltaTableAPIPyV2 from '/snippets/destination_connectors/delta_table.v2.py.mdx';
19+
import DeltaTableAPIPyV1 from '/snippets/destination_connectors/delta_table.v1.py.mdx';
20+
21+
<CodeGroup>
22+
<DeltaTableAPISh />
23+
<DeltaTableAPIPyV2 />
24+
<DeltaTableAPIPyV1 />
25+
</CodeGroup>

open-source/ingest/destination-connectors/delta-table.mdx

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,28 @@
22
title: Delta Table
33
---
44

5-
import SharedDeltaTable from '/snippets/dc-shared-text/delta-table.mdx';
5+
import NewDocument from '/snippets/general-shared-text/new-document.mdx';
6+
7+
<NewDocument />
8+
9+
import SharedDeltaTable from '/snippets/dc-shared-text/delta-table-cli-api.mdx';
610

711
<SharedDeltaTable />
12+
13+
Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector.
14+
15+
This example sends files to Unstructured API services for processing by default. To process files locally instead, see the instructions at the end of this page.
16+
17+
import DeltaTableAPISh from '/snippets/destination_connectors/delta_table.sh.mdx';
18+
import DeltaTableAPIPyV2 from '/snippets/destination_connectors/delta_table.v2.py.mdx';
19+
import DeltaTableAPIPyV1 from '/snippets/destination_connectors/delta_table.v1.py.mdx';
20+
21+
<CodeGroup>
22+
<DeltaTableAPISh />
23+
<DeltaTableAPIPyV2 />
24+
<DeltaTableAPIPyV1 />
25+
</CodeGroup>
26+
27+
import SharedPartitionByAPIOSS from '/snippets/ingest-configuration-shared/partition-by-api-oss.mdx';
28+
29+
<SharedPartitionByAPIOSS/>
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
Batch process all your records to store structured outputs in a Delta Table in an Amazon S3 bucket.
2+
3+
You will need:
4+
5+
import SharedDeltaTable from '/snippets/general-shared-text/delta-table.mdx';
6+
import SharedDeltaTableCLIAPI from '/snippets/general-shared-text/delta-table-cli-api.mdx';
7+
8+
<SharedDeltaTable />
9+
<SharedDeltaTableCLIAPI />

snippets/dc-shared-text/delta-table.mdx

Lines changed: 0 additions & 25 deletions
This file was deleted.
Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,20 @@
1-
```bash Shell
1+
```bash CLI
22
#!/usr/bin/env bash
33

44
# Chunking and embedding are optional.
55

66
unstructured-ingest \
77
local \
88
--input-path $LOCAL_FILE_INPUT_DIR \
9-
--output-dir $LOCAL_FILE_OUTPUT_DIR \
9+
--partition-by-api \
10+
--api-key $UNSTRUCTURED_API_KEY \
11+
--partition-endpoint $UNSTRUCTURED_API_URL \
1012
--strategy hi_res \
11-
--chunk-elements \
13+
--additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
14+
--chunking-strategy by_title \
1215
--embedding-provider huggingface \
13-
--num-processes 2 \
14-
--verbose \
1516
delta-table \
16-
--table-uri delta-table-dest
17+
--aws-access-key-id $AWS_ACCESS_KEY_ID \
18+
--aws-secret-access-key $AWS_SECRET_ACCESS_KEY \
19+
--table-uri $AWS_S3_URL
1720
```

snippets/destination_connectors/delta_table.py.mdx renamed to snippets/destination_connectors/delta_table.v1.py.mdx

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
```python Python
1+
```python Python Ingest v1
22
import os
33

44
from unstructured_ingest.connector.delta_table import DeltaTableWriteConfig, SimpleDeltaTableConfig
@@ -20,9 +20,8 @@ from unstructured_ingest.runner.writers.delta_table import (
2020
def get_writer() -> Writer:
2121
return DeltaTableWriter(
2222
connector_config=SimpleDeltaTableConfig(
23-
table_uri="delta-table-dest",
23+
table_uri=os.getenv("AWS_S3_URL"),
2424
storage_options={
25-
"AWS_REGION": "us-east-2",
2625
"AWS_ACCESS_KEY_ID": os.getenv("AWS_ACCESS_KEY_ID"),
2726
"AWS_SECRET_ACCESS_KEY": os.getenv("AWS_SECRET_ACCESS_KEY"),
2827
},
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
```python Python Ingest v2
2+
import os
3+
4+
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
5+
from unstructured_ingest.v2.interfaces import ProcessorConfig
6+
7+
from unstructured_ingest.v2.processes.connectors.delta_table import (
8+
DeltaTableConnectionConfig,
9+
DeltaTableAccessConfig,
10+
DeltaTableUploadStagerConfig,
11+
DeltaTableUploaderConfig
12+
)
13+
14+
from unstructured_ingest.v2.processes.connectors.local import (
15+
LocalIndexerConfig,
16+
LocalConnectionConfig,
17+
LocalDownloaderConfig
18+
)
19+
20+
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
21+
from unstructured_ingest.v2.processes.chunker import ChunkerConfig
22+
from unstructured_ingest.v2.processes.embedder import EmbedderConfig
23+
24+
# Chunking and embedding are optional.
25+
26+
if __name__ == "__main__":
27+
28+
Pipeline.from_configs(
29+
context=ProcessorConfig(),
30+
indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
31+
downloader_config=LocalDownloaderConfig(),
32+
source_connection_config=LocalConnectionConfig(),
33+
partitioner_config=PartitionerConfig(
34+
partition_by_api=True,
35+
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
36+
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
37+
additional_partition_args={
38+
"split_pdf_page": True,
39+
"split_pdf_allow_failed": True,
40+
"split_pdf_concurrency_level": 15
41+
}
42+
),
43+
chunker_config=ChunkerConfig(chunking_strategy="by_title"),
44+
embedder_config=EmbedderConfig(embedding_provider="huggingface"),
45+
destination_connection_config=DeltaTableConnectionConfig(
46+
access_config=DeltaTableAccessConfig(
47+
aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
48+
aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY")
49+
),
50+
table_uri=os.getenv("AWS_S3_URL")
51+
),
52+
stager_config=DeltaTableUploadStagerConfig(),
53+
uploader_config=DeltaTableUploaderConfig()
54+
).run()
55+
```
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
The Delta Table connector dependencies for Amazon S3:
2+
3+
```bash CLI, Python
4+
pip install "unstructured-ingest[delta-table]"
5+
```
6+
7+
import AdditionalIngestDependencies from '/snippets/general-shared-text/ingest-dependencies.mdx';
8+
9+
<AdditionalIngestDependencies />
10+
11+
The following environment variables:
12+
13+
- `AWS_S3_URL` - The path to the S3 bucket or folder, formatted as `s3://my-bucket/` (if the files are in the bucket's root) or `s3://my-bucket/my-folder/`, represented by `--table-uri` (CLI) or `table_uri` (Python).
14+
- `AWS_ACCESS_KEY_ID` - The AWS access key ID for the authenticated AWS IAM user, represented by `--aws-access-key-id` (CLI) or `aws_access_key` (Python).
15+
- `AWS_SECRET_ACCESS_KEY` - The corresponding AWS secret access key, represented by `--aws-secret-access-key` (CLI) or `aws_secret_access_key` (Python).
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
The Delta Table prerequisites for Amazon S3:
2+
3+
The following video shows how to fulfill the minimum set of S3 prerequisites:
4+
5+
<iframe
6+
width="560"
7+
height="315"
8+
src="https://www.youtube.com/embed/_W4565dcUGI"
9+
title="YouTube video player"
10+
frameborder="0"
11+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
12+
allowfullscreen
13+
></iframe>
14+
15+
The preceding video does not show how to create an AWS account or an S3 bucket.
16+
17+
For more information about prerequisites, see the following:
18+
19+
- An AWS account. [Create an AWS account](https://aws.amazon.com/free).
20+
21+
<iframe
22+
width="560"
23+
height="315"
24+
src="https://www.youtube.com/embed/lIdh92JmWtg"
25+
title="YouTube video player"
26+
frameborder="0"
27+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
28+
allowfullscreen
29+
></iframe>
30+
31+
- An S3 bucket. [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html).
32+
Additional approaches are in the following video and in the how-to sections at the end of this page.
33+
34+
<iframe
35+
width="560"
36+
height="315"
37+
src="https://www.youtube.com/embed/e6w9LwZJFIA"
38+
title="YouTube video player"
39+
frameborder="0"
40+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
41+
allowfullscreen
42+
></iframe>
43+
44+
- For authenticated bucket read access, the authenticated AWS IAM user must have at minimum the permissions of `s3:ListBucket` and `s3:GetObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html).
45+
46+
<iframe
47+
width="560"
48+
height="315"
49+
src="https://www.youtube.com/embed/y4SfQoJpipo"
50+
title="YouTube video player"
51+
frameborder="0"
52+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
53+
allowfullscreen
54+
></iframe>
55+
56+
- For bucket write access, authenticated access to the bucket must be enabled (anonymous access must not be enabled), and the authenticated AWS IAM user must have at
57+
minimum the permission of `s3:PutObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html).
58+
59+
- For authenticated access, an AWS access key and secret access key for the authenticated AWS IAM user in the account.
60+
[Create an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey).
61+
62+
<iframe
63+
width="560"
64+
height="315"
65+
src="https://www.youtube.com/embed/MoFTaGJE65Q"
66+
title="YouTube video player"
67+
frameborder="0"
68+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
69+
allowfullscreen
70+
></iframe>
71+
72+
- If the target files are in the root of the bucket, the path to the bucket, formatted as `protocol://bucket/` (for example, `s3://my-bucket/`).
73+
If the target files are in a folder, the path to the target folder in the S3 bucket, formatted as `protocol://bucket/path/to/folder/` (for example, `s3://my-bucket/my-folder/`).
74+
- If the target files are in a folder, make sure the authenticated AWS IAM user has
75+
authenticated access to the folder as well. [Enable authenticated folder access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-folders).

0 commit comments

Comments
 (0)