Skip to content

Commit f22c292

Browse files
authored
API: LanceDB v2 destination connector (#345)
1 parent 1295f29 commit f22c292

File tree

8 files changed

+339
-0
lines changed

8 files changed

+339
-0
lines changed
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
title: LanceDB
3+
---
4+
5+
import NewDocument from '/snippets/general-shared-text/new-document.mdx';
6+
7+
<NewDocument />
8+
9+
import SharedContentLanceDB from '/snippets/dc-shared-text/lancedb-cli-api.mdx';
10+
import SharedAPIKeyURL from '/snippets/general-shared-text/api-key-url.mdx';
11+
12+
<SharedContentLanceDB/>
13+
<SharedAPIKeyURL/>
14+
15+
Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported. This example uses the local source connector:
16+
17+
import LanceDBAPISh from '/snippets/destination_connectors/lancedb.sh.mdx';
18+
import LanceDBAPIPyV2 from '/snippets/destination_connectors/lancedb.v2.py.mdx';
19+
20+
<CodeGroup>
21+
<LanceDBAPISh />
22+
<LanceDBAPIPyV2 />
23+
</CodeGroup>

mint.json

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -199,6 +199,7 @@
199199
"open-source/ingest/destination-connectors/google-cloud-service",
200200
"open-source/ingest/destination-connectors/kafka",
201201
"open-source/ingest/destination-connectors/kdbai",
202+
"open-source/ingest/destination-connectors/lancedb",
202203
"open-source/ingest/destination-connectors/local",
203204
"open-source/ingest/destination-connectors/milvus",
204205
"open-source/ingest/destination-connectors/mongodb",
@@ -357,6 +358,7 @@
357358
"api-reference/ingest/destination-connector/google-cloud-service",
358359
"api-reference/ingest/destination-connector/kafka",
359360
"api-reference/ingest/destination-connector/kdbai",
361+
"api-reference/ingest/destination-connector/lancedb",
360362
"api-reference/ingest/destination-connector/local",
361363
"api-reference/ingest/destination-connector/milvus",
362364
"api-reference/ingest/destination-connector/mongodb",
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
---
2+
title: LanceDB
3+
---
4+
5+
import NewDocument from '/snippets/general-shared-text/new-document.mdx';
6+
7+
<NewDocument />
8+
9+
import SharedLanceDB from '/snippets/dc-shared-text/lancedb-cli-api.mdx';
10+
11+
<SharedLanceDB />
12+
13+
Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector:
14+
15+
This example sends files to Unstructured API services for processing by default. To process files locally instead, see the instructions at the end of this page.
16+
17+
import LanceDBAPISh from '/snippets/destination_connectors/lancedb.sh.mdx';
18+
import LanceDBAPIPyV2 from '/snippets/destination_connectors/lancedb.v2.py.mdx';
19+
20+
<CodeGroup>
21+
<LanceDBAPISh />
22+
<LanceDBAPIPyV2 />
23+
</CodeGroup>
24+
25+
import SharedPartitionByAPIOSS from '/snippets/ingest-configuration-shared/partition-by-api-oss.mdx';
26+
27+
<SharedPartitionByAPIOSS/>
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
Batch process all your records to store structured outputs in LanceDB.
2+
3+
You will need:
4+
5+
import SharedLanceDB from '/snippets/general-shared-text/lancedb.mdx';
6+
import SharedLanceDBCLIAPI from '/snippets/general-shared-text/lancedb-cli-api.mdx';
7+
8+
<SharedLanceDB />
9+
<SharedLanceDBCLIAPI />
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
```bash CLI
2+
#!/usr/bin/env bash
3+
4+
# Chunking and embedding are optional.
5+
6+
# For LanceDB OSS with local data storage:
7+
unstructured-ingest \
8+
local \
9+
--input-path $LOCAL_FILE_INPUT_DIR \
10+
--chunking-strategy by_title \
11+
--embedding-provider huggingface \
12+
--partition-by-api \
13+
--api-key $UNSTRUCTURED_API_KEY \
14+
--partition-endpoint $UNSTRUCTURED_API_URL \
15+
--additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
16+
lancedb-local \
17+
--uri $LANCEDB_URI \
18+
--table-name $LANCEDB_TABLE
19+
20+
# For LanceDB OSS with data storage in an Amazon S3 bucket:
21+
unstructured-ingest \
22+
local \
23+
--input-path $LOCAL_FILE_INPUT_DIR \
24+
--chunking-strategy by_title \
25+
--embedding-provider huggingface \
26+
--partition-by-api \
27+
--api-key $UNSTRUCTURED_API_KEY \
28+
--partition-endpoint $UNSTRUCTURED_API_URL \
29+
--additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
30+
lancedb-aws \
31+
--aws-access-key-id $AWS_ACCESS_KEY_ID \
32+
--aws-secret-access-key $AWS_SECRET_ACCESS_KEY \
33+
--uri $LANCEDB_URI \
34+
--table-name $LANCEDB_TABLE \
35+
--timeout 30s
36+
37+
# For LanceDB OSS with data storage in an Azure Blob Storage account:
38+
unstructured-ingest \
39+
local \
40+
--input-path $LOCAL_FILE_INPUT_DIR \
41+
--chunking-strategy by_title \
42+
--embedding-provider huggingface \
43+
--partition-by-api \
44+
--api-key $UNSTRUCTURED_API_KEY \
45+
--partition-endpoint $UNSTRUCTURED_API_URL \
46+
--additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
47+
lancedb-azure \
48+
--azure-storage-account-name $AZURE_STORAGE_ACCOUNT_NAME \
49+
--azure-storage-account-key $AZURE_STORAGE_ACCOUNT_KEY \
50+
--uri $LANCEDB_URI \
51+
--table-name $LANCEDB_TABLE \
52+
--timeout 30s
53+
54+
# For LanceDB OSS with data storage in a Google Cloud Storage bucket:
55+
unstructured-ingest \
56+
local \
57+
--input-path $LOCAL_FILE_INPUT_DIR \
58+
--chunking-strategy by_title \
59+
--embedding-provider huggingface \
60+
--partition-by-api \
61+
--api-key $UNSTRUCTURED_API_KEY \
62+
--partition-endpoint $UNSTRUCTURED_API_URL \
63+
--additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
64+
lancedb-gcs \
65+
--google-service-account-key $GCS_SERVICE_ACCOUNT_KEY \
66+
--uri $LANCEDB_URI \
67+
--table-name $LANCEDB_TABLE \
68+
--timeout 30s
69+
```
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
```python Python Ingest v2
2+
import os
3+
4+
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
5+
from unstructured_ingest.v2.interfaces import ProcessorConfig
6+
7+
from unstructured_ingest.v2.processes.connectors.local import (
8+
LocalIndexerConfig,
9+
LocalDownloaderConfig,
10+
LocalConnectionConfig
11+
)
12+
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
13+
from unstructured_ingest.v2.processes.chunker import ChunkerConfig
14+
from unstructured_ingest.v2.processes.embedder import EmbedderConfig
15+
16+
# For LanceDB OSS with local data storage:
17+
# from unstructured_ingest.v2.processes.connectors.lancedb.local import (
18+
# LanceDBLocalConnectionConfig,
19+
# LanceDBLocalAccessConfig,
20+
# LanceDBUploadStagerConfig,
21+
# LanceDBUploaderConfig
22+
# )
23+
24+
# For LanceDB OSS with data storage in an Amazon S3 bucket:
25+
from unstructured_ingest.v2.processes.connectors.lancedb.aws import (
26+
LanceDBS3ConnectionConfig,
27+
LanceDBS3AccessConfig,
28+
LanceDBUploadStagerConfig,
29+
LanceDBUploaderConfig
30+
)
31+
32+
# For LanceDB OSS with data storage in an Azure Blob Storage account:
33+
# from unstructured_ingest.v2.processes.connectors.lancedb.azure import (
34+
# LanceDBAzureConnectionConfig,
35+
# LanceDBAzureAccessConfig,
36+
# LanceDBUploadStagerConfig,
37+
# LanceDBUploaderConfig
38+
# )
39+
40+
# For LanceDB OSS with data storage in a Google Cloud Storage bucket:
41+
# from unstructured_ingest.v2.processes.connectors.lancedb.gcp import (
42+
# LanceDBGCSConnectionConfig,
43+
# LanceDBGCSAccessConfig,
44+
# LanceDBUploadStagerConfig,
45+
# LanceDBUploaderConfig
46+
# )
47+
48+
# Chunking and embedding are optional.
49+
50+
if __name__ == "__main__":
51+
Pipeline.from_configs(
52+
context=ProcessorConfig(),
53+
indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
54+
downloader_config=LocalDownloaderConfig(),
55+
source_connection_config=LocalConnectionConfig(),
56+
partitioner_config=PartitionerConfig(
57+
partition_by_api=True,
58+
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
59+
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
60+
additional_partition_args={
61+
"split_pdf_page": True,
62+
"split_pdf_allow_failed": True,
63+
"split_pdf_concurrency_level": 15
64+
}
65+
),
66+
chunker_config=ChunkerConfig(chunking_strategy="by_title"),
67+
embedder_config=EmbedderConfig(embedding_provider="huggingface"),
68+
69+
# For LanceDB OSS with local data storage:
70+
# destination_connection_config=LanceDBLocalConnectionConfig(
71+
# access_config=LanceDBLocalAccessConfig(),
72+
# uri=os.getenv("LANCEDB_URI")
73+
# ),
74+
75+
# For LanceDB OSS with data storage in an Amazon S3 bucket:
76+
destination_connection_config=LanceDBS3ConnectionConfig(
77+
access_config=LanceDBS3AccessConfig(
78+
aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
79+
aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY")
80+
),
81+
uri=os.getenv("LANCEDB_URI"),
82+
timeout="30s"
83+
),
84+
85+
# For LanceDB OSS with data storage in an Azure Blob Storage account:
86+
# destination_connection_config=LanceDBAzureConnectionConfig(
87+
# access_config=LanceDBAzureAccessConfig(
88+
# azure_storage_account_name=os.getenv("AZURE_STORAGE_ACCOUNT_NAME"),
89+
# azure_storage_account_key=os.getenv("AZURE_STORAGE_ACCOUNT_KEY")
90+
# ),
91+
# uri=os.getenv("LANCEDB_URI"),
92+
# timeout="30s"
93+
# ),
94+
95+
# For LanceDB OSS with data storage in a Google Cloud Storage bucket:
96+
# destination_connection_config=LanceDBGCSConnectionConfig(
97+
# access_config=LanceDBGCSAccessConfig(
98+
# google_service_account_key=os.getenv("GCS_SERVICE_ACCOUNT_KEY")
99+
# ),
100+
# uri=os.getenv("LANCEDB_URI"),
101+
# timeout="30s"
102+
# ),
103+
104+
stager_config=LanceDBUploadStagerConfig(),
105+
uploader_config=LanceDBUploaderConfig(table_name=os.gentenv("LANCEDB_TABLE"))
106+
).run()
107+
```
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
The LanceDB connector dependencies:
2+
3+
```bash CLI, Python
4+
pip install "unstructured-ingest[lancedb]"
5+
```
6+
7+
import AdditionalIngestDependencies from '/snippets/general-shared-text/ingest-dependencies.mdx';
8+
9+
<AdditionalIngestDependencies />
10+
11+
The following environment variables:
12+
13+
- For LanceDB OSS with local data storage:
14+
15+
- `LANCEDB_URI` - The local path to the folder where the LanceDB data is stored, represented by `--uri` (CLI) or `uri` (Python).
16+
- `LANCEDB_TABLE` - The name of the target LanceDB table within the local data folder, represented by `--table-name` (CLI) or `table_name` (Python).
17+
18+
- For LanceDB OSS with data storage in an Amazon S3 bucket:
19+
20+
- `LANCEDB_URI` - The URI for the target Amazon S3 bucket and any target folder path within that bucket. Use the format `s3://<bucket-name>[/<folder-name>]`. This is represented by `--uri` (CLI) or `uri` (Python).
21+
- `LANCEDB_TABLE` - The name of the target LanceDB table within the Amazon S3 bucket, rrepresented by `--table-name` (CLI) or `table_name` (Python).
22+
- `AWS_ACCESS_KEY_ID` - The AWS access key ID for the AWS IAM entity that has access to the Amazon S3 bucket, represented by `--aws-access-key-id` (CLI) or `aws_access_key_id` (Python).
23+
- `AWS_SECRET_ACCESS_KEY` - The AWS secret access key for the AWS IAM entity that has access to the Amazon S3 bucket, represented by `--aws-secret-access-key` (CLI) or `aws_secret_access_key` (Python).
24+
25+
- For LanceDB OSS with data storage in an Azure Blob Storage account:
26+
27+
- `LANCEDB_URI` - The URI for the target container within that Azure Blob Storage account and any target folder path within that container. Use the format `az://<container-name>[/<folder-name>]`. This is represented by `--uri` (CLI) or `uri` (Python).
28+
- `LANCEDB_TABLE` - The name of the target LanceDB table within the Azure Blob Storage account, represented by `--table-name` (CLI) or `table_name` (Python).
29+
- `AZURE_STORAGE_ACCOUNT_NAME` - The name of the target Azure Blob Storage account, represented by `--azure-storage-account-name` (CLI) or `azure_storage_account_name` (Python).
30+
- `AZURE_STORAGE_ACCOUNT_KEY` - The access key for the Azure Blob Storage account, represented by `--azure-storage-account-key` (CLI) or `azure_storage_account_key` (Python).
31+
32+
- For LanceDB OSS with data storage in a Google Cloud Storage bucket:
33+
34+
- `LANCEDB_URI` - The URI for the target Google Cloud Storage bucket and any target folder path within that bucket. Use the format `gs://<bucket-name>[/<folder-name>]`. This is represented by `--uri` (CLI) or `uri` (Python).
35+
- `LANCEDB_TABLE` - The name of the target LanceDB table within the Google Cloud Storage bucket, represented by `--table-name` (CLI) or `table_name` (Python).
36+
- `GCS_SERVICE_ACCOUNT_KEY` - A single-line string that contains the contents of the downloaded service account key file for the Google Cloud service account
37+
that has access to the Google Cloud Storage bucket, represented by `--google-service-account-key` (CLI) or `google_service_account_key` (Python).
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
The LanceDB prerequisites:
2+
3+
- A [LanceDB open source software (OSS) installation](https://lancedb.github.io/lancedb/basic/#installation) on a local machine, a server, or a virtual machine.
4+
(LanceDB Cloud is not supported.)
5+
- For LanceDB OSS with local data storage:
6+
7+
- The local path to the folder where the LanceDB data is (or will be) stored.
8+
See [Connect to a database](https://lancedb.github.io/lancedb/basic/#connect-to-a-database) in the LanceDB documentation.
9+
- The name of the target [LanceDB table](https://lancedb.github.io/lancedb/basic/#create-an-empty-table) within the local data folder.
10+
11+
- For LanceDB OSS with data storage in an Amazon S3 bucket:
12+
13+
- The URI for the target Amazon S3 bucket and any target folder path within that bucket. Use the format `s3://<bucket-name>[/<folder-name>]`.
14+
- The name of the target [LanceDB table](https://lancedb.github.io/lancedb/guides/storage/#object-stores) within the Amazon S3 bucket.
15+
- The AWS access key ID and AWS secret access key for the AWS IAM entity that has access to the Amazon S3 bucket.
16+
17+
For more information, see [AWS S3](https://lancedb.github.io/lancedb/guides/storage/#aws-s3) in the LanceDB documentation, along with the following video:
18+
19+
<iframe
20+
width="560"
21+
height="315"
22+
src="https://www.youtube.com/embed/hyDHfhVVAhs"
23+
title="YouTube video player"
24+
frameborder="0"
25+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
26+
allowfullscreen
27+
></iframe>
28+
29+
- For LanceDB OSS with data storage in an Azure Blob Storage account:
30+
31+
- The name of the target Azure Blob Storage account.
32+
= The URI for the target container within that Azure Blob Storage account and any target folder path within that container. Use the format `az://<container-name>[/<folder-name>]`.
33+
- The name of the target [LanceDB table](https://lancedb.github.io/lancedb/guides/storage/#object-stores) within the Azure Blob Storage account.
34+
- The access key for the Azure Blob Storage account.
35+
36+
For more information, see [Azure Blob Storage](https://lancedb.github.io/lancedb/guides/storage/#azure-blob-storage) in the LanceDB documentation, along with the following video:
37+
38+
<iframe
39+
width="560"
40+
height="315"
41+
src="https://www.youtube.com/embed/Vl3KCphlh9Y"
42+
title="YouTube video player"
43+
frameborder="0"
44+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
45+
allowfullscreen
46+
></iframe>
47+
48+
- For LanceDB OSS with data storage in a Google Cloud Storage bucket:
49+
50+
- The URI for the target Google Cloud Storage bucket and any target folder path within that bucket. Use the format `gs://<bucket-name>[/<folder-name>]`.
51+
- The name of the target [LanceDB table](https://lancedb.github.io/lancedb/guides/storage/#object-stores) within the Google Cloud Storage bucket.
52+
- A single-line string that contains the contents of the downloaded service account key file for the Google Cloud service account that has access to the
53+
Google Cloud Storage bucket.
54+
55+
For more information, see [Google Cloud Storage](https://lancedb.github.io/lancedb/guides/storage/#google-cloud-storage) in the LanceDB documentation, along with the following video:
56+
57+
<iframe
58+
width="560"
59+
height="315"
60+
src="https://www.youtube.com/embed/HYaALQ0F-L4"
61+
title="YouTube video player"
62+
frameborder="0"
63+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
64+
allowfullscreen
65+
></iframe>

0 commit comments

Comments
 (0)