Skip to content

Commit c3ce895

Browse files
authored
Ingest v2: Vectara destination connector (#424)
1 parent 115c954 commit c3ce895

File tree

10 files changed

+150
-34
lines changed

10 files changed

+150
-34
lines changed

api-reference/ingest/destination-connector/vectara.mdx

Lines changed: 25 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,29 @@
22
title: Vectara
33
---
44

5-
import SharedVectara from '/snippets/dc-shared-text/vectara.mdx';
5+
import NewDocument from '/snippets/general-shared-text/new-document.mdx';
6+
7+
<NewDocument />
8+
9+
import SharedContentVectara from '/snippets/dc-shared-text/vectara-cli-api.mdx';
10+
import SharedAPIKeyURL from '/snippets/general-shared-text/api-key-url.mdx';
11+
12+
<SharedContentVectara/>
13+
<SharedAPIKeyURL/>
14+
15+
Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported.
16+
17+
This example uses the local source connector:
18+
19+
import VectaraAPISh from '/snippets/destination_connectors/vectara.sh.mdx';
20+
import VectaraAPIPyV2 from '/snippets/destination_connectors/vectara.v2.py.mdx';
21+
import VectaraAPIPyV1 from '/snippets/destination_connectors/vectara.v1.py.mdx';
22+
23+
<CodeGroup>
24+
<VectaraAPISh />
25+
<VectaraAPIPyV2 />
26+
<VectaraAPIPyV1 />
27+
</CodeGroup>
28+
29+
630

7-
<SharedVectara />

api-reference/ingest/ingest-dependencies.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,7 @@ To add support for additional connectors, run the following:
8787
| `pip install "unstructured-ingest[snowflake]"` | Snowflake |
8888
| `pip install "unstructured-ingest[sftp]"` | SFTP |
8989
| `pip install "unstructured-ingest[slack]"` | Slack |
90+
| `pip install "unstructured-ingest[vectara]"` | Vectara |
9091
| `pip install "unstructured-ingest[wikipedia]"` | Wikipedia |
9192
| `pip install "unstructured-ingest[weaviate]"` | Weaviate |
9293

open-source/ingest/destination-connectors/vectara.mdx

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,28 @@
22
title: Vectara
33
---
44

5-
import SharedVectara from '/snippets/dc-shared-text/vectara.mdx';
5+
<NewDocument />
66

7-
<SharedVectara />
7+
import SharedContentVectara from '/snippets/dc-shared-text/vectara-cli-api.mdx';
8+
9+
<SharedContentVectara/>
10+
11+
Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported.
12+
13+
This example uses the local source connector.
14+
15+
This example sends files to Unstructured API services for processing by default. To process files locally instead, see the instructions at the end of this page.
16+
17+
import VectaraAPISh from '/snippets/destination_connectors/vectara.sh.mdx';
18+
import VectaraAPIPyV2 from '/snippets/destination_connectors/vectara.v2.py.mdx';
19+
import VectaraAPIPyV1 from '/snippets/destination_connectors/vectara.v1.py.mdx';
20+
21+
<CodeGroup>
22+
<VectaraAPISh />
23+
<VectaraAPIPyV2 />
24+
<VectaraAPIPyV1 />
25+
</CodeGroup>
26+
27+
import SharedPartitionByAPIOSS from '/snippets/ingest-configuration-shared/partition-by-api-oss.mdx';
28+
29+
<SharedPartitionByAPIOSS/>
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
Batch process all your records to store structured outputs in Vectara.
2+
3+
The requirements are as follows.
4+
5+
import SharedVectara from '/snippets/general-shared-text/vectara.mdx';
6+
import SharedVectaraCLIAPI from '/snippets/general-shared-text/vectara-cli-api.mdx';
7+
8+
<SharedVectara />
9+
<SharedVectaraCLIAPI />

snippets/dc-shared-text/vectara.mdx

Lines changed: 0 additions & 19 deletions
This file was deleted.
Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,22 @@
1-
```bash Shell
1+
```bash CLI
22
#!/usr/bin/env bash
33

4-
# Chunking is optional.
4+
# Chunking and embedding is optional.
55

66
unstructured-ingest \
77
local \
88
--input-path $LOCAL_FILE_INPUT_DIR \
9-
--output-dir $LOCAL_FILE_OUTPUT_DIR \
10-
--strategy hi_res \
11-
--chunk-elements \
12-
--num-processes 2 \
13-
--verbose \
9+
--chunking-strategy by_title \
10+
--embedding-provider huggingface \
11+
--partition-by-api \
12+
--api-key $UNSTRUCTURED_API_KEY \
13+
--partition-endpoint $UNSTRUCTURED_API_URL \
14+
--additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
1415
vectara \
1516
--customer-id $VECTARA_CUSTOMER_ID \
17+
--corpus-name $VECTARA_CORPUS_NAME \
18+
--corpus-key $VECTARA_CORPUS_KEY \
1619
--oauth-client-id $VECTARA_OAUTH_CLIENT_ID \
17-
--oauth-secret $VECTARA_OAUTH_SECRET \
18-
--corpus-name test-corpus-vectara
20+
--oauth-secret $VECTARA_OAUTH_CLIENT_SECRET \
21+
--token-url $VECTARA_OAUTH_TOKEN_URL
1922
```

snippets/destination_connectors/vectara.py.mdx renamed to snippets/destination_connectors/vectara.v1.py.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
```python Python
1+
```python Python Ingest v1
22
import os
33

44
from unstructured_ingest.connector.local import SimpleLocalConfig
@@ -24,7 +24,7 @@ def get_writer() -> Writer:
2424
connector_config=SimpleVectaraConfig(
2525
access_config=VectaraAccessConfig(
2626
oauth_client_id=os.getenv("VECTARA_OAUTH_CLIENT_ID"),
27-
oauth_secret=os.getenv("VECTARA_OAUTH_SECRET"),
27+
oauth_secret=os.getenv("VECTARA_OAUTH_CLIENT_SECRET"),
2828
),
2929
customer_id=os.getenv("VECTARA_CUSTOMER_ID"),
3030
corpus_name="test-corpus-vectara",
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
```python Python Ingest v2
2+
import os
3+
4+
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
5+
from unstructured_ingest.v2.interfaces import ProcessorConfig
6+
7+
from unstructured_ingest.v2.processes.connectors.vectara import (
8+
VectaraAccessConfig,
9+
VectaraConnectionConfig,
10+
VectaraUploadStagerConfig,
11+
VectaraUploaderConfig
12+
)
13+
from unstructured_ingest.v2.processes.connectors.local import (
14+
LocalIndexerConfig,
15+
LocalConnectionConfig,
16+
LocalDownloaderConfig
17+
)
18+
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
19+
from unstructured_ingest.v2.processes.chunker import ChunkerConfig
20+
from unstructured_ingest.v2.processes.embedder import EmbedderConfig
21+
22+
# Chunking and embedding is optional.
23+
24+
if __name__ == "__main__":
25+
Pipeline.from_configs(
26+
context=ProcessorConfig(),
27+
indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
28+
downloader_config=LocalDownloaderConfig(),
29+
source_connection_config=LocalConnectionConfig(),
30+
partitioner_config=PartitionerConfig(
31+
partition_by_api=True,
32+
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
33+
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
34+
additional_partition_args={
35+
"split_pdf_page": True,
36+
"split_pdf_allow_failed": True,
37+
"split_pdf_concurrency_level": 15
38+
}
39+
),
40+
chunker_config=ChunkerConfig(chunking_strategy="by_title"),
41+
embedder_config=EmbedderConfig(embedding_provider="huggingface"),
42+
destination_connection_config=VectaraConnectionConfig(
43+
access_config=VectaraAccessConfig(
44+
oauth_client_id=os.getenv("VECTARA_OAUTH_CLIENT_ID"),
45+
oauth_secret=os.getenv("VECTARA_OAUTH_CLIENT_SECRET")
46+
),
47+
customer_id=os.getenv("VECTARA_CUSTOMER_ID"),
48+
corpus_name=os.getenv("VECTARA_CORPUS_NAME"),
49+
corpus_key=os.getenv("VECTARA_CORPUS_KEY"),
50+
token_url=os.getenv("VECTARA_OAUTH_TOKEN_URL")
51+
),
52+
stager_config=VectaraUploadStagerConfig(),
53+
uploader_config=VectaraUploaderConfig()
54+
).run()
55+
```
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
The Vectara connector dependencies.
2+
3+
```bash
4+
pip install "unstructured-ingest[vectara]"
5+
```
6+
7+
import AdditionalIngestDependencies from '/snippets/general-shared-text/ingest-dependencies.mdx';
8+
9+
<AdditionalIngestDependencies />
10+
11+
The following environment variables:
12+
13+
- `VECTARA_CUSTOMER_ID` - The customer ID for the target Vectara account, represented by `--customer-id` (CLI) or `customer_id` (Python).
14+
- `VECTARA_CORPUS_NAME` - The name of the target corpus in the account, represented by `--corpus-name` (CLI) or `corpus_name` (Python).
15+
- `VECTARA_CORPUS_KEY` - The name of the corpus's key, represented by `--corpus-key` (CLI) or `corpus_key` (Python).
16+
- `VECTARA_OAUTH_TOKEN_URL` - The OAuth token URL for getting and refreshing OAuth access tokens in the account, represented by `--token-url` (CLI) or `token_url` (Python).
17+
- `VECTARA_OAUTH_CLIENT_ID` - A valid OAuth client ID in the account, represented by `--oauth-client-id` (CLI) or `oauth_client_id` (Python).
18+
- `VECTARA_OAUTH_CLIENT_SECRET` - The OAuth client secret for the client ID, represented by `--oauth-client-secret` (CLI) or `oauth_client_secret` (Python).
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
- A [Vectara account](https://console.vectara.com/signup).
2+
- The [customer ID](https://docs.vectara.com/docs/console-ui/vectara-console-overview#view-the-customer-id) for the account.
3+
- The name and key for the target [corpus](https://docs.vectara.com/docs/console-ui/creating-a-corpus) in the account.
4+
- The [OAuth authentication URL, client ID, and client secret](https://docs.vectara.com/docs/console-ui/app-clients) for accessing the target corpus.

0 commit comments

Comments
 (0)