Skip to content

Commit ac30268

Browse files
authored
chore: update sphinx ingest docs with new connectors (#2245)
Replacing #2243
1 parent da7ac62 commit ac30268

File tree

159 files changed

+2942
-2079
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

159 files changed

+2942
-2079
lines changed

docs/source/ingest/destination_connectors.rst

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,14 @@ in our community `Slack. <https://short.unstructured.io/pzw05l7>`_
88
.. toctree::
99
:maxdepth: 1
1010

11+
destination_connectors/azure
1112
destination_connectors/azure_cognitive_search
13+
destination_connectors/box
1214
destination_connectors/delta_table
15+
destination_connectors/dropbox
16+
destination_connectors/gcs
1317
destination_connectors/mongodb
14-
destination_connectors/weaviate
1518
destination_connectors/pinecone
1619
destination_connectors/s3
20+
destination_connectors/weaviate
21+
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
Azure
2+
===========
3+
4+
Batch process all your records using ``unstructured-ingest`` to store structured outputs locally on your filesystem and upload those local files to an Azure bucket.
5+
6+
First you'll need to install the Azure dependencies as shown here.
7+
8+
.. code:: shell
9+
10+
pip install "unstructured[azure]"
11+
12+
Run Locally
13+
-----------
14+
The upstream connector can be any of the ones supported, but for convenience here, showing a sample command using the
15+
upstream local connector.
16+
17+
.. tabs::
18+
19+
.. tab:: Shell
20+
21+
.. literalinclude:: ./code/bash/azure.sh
22+
:language: bash
23+
24+
.. tab:: Python
25+
26+
.. literalinclude:: ./code/python/azure.py
27+
:language: python
28+
29+
30+
For a full list of the options the CLI accepts check ``unstructured-ingest <upstream connector> azure --help``.
31+
32+
NOTE: Keep in mind that you will need to have all the appropriate extras and dependencies for the file types of the documents contained in your data storage platform if you're running this locally. You can find more information about this in the `installation guide <https://unstructured-io.github.io/unstructured/installing.html>`_.

docs/source/ingest/destination_connectors/azure_cognitive_search.rst

Lines changed: 6 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -12,60 +12,19 @@ First you'll need to install the azure cognitive search dependencies as shown he
1212
Run Locally
1313
-----------
1414
The upstream connector can be any of the ones supported, but for convenience here, showing a sample command using the
15-
upstream s3 connector.
15+
upstream local connector.
1616

1717
.. tabs::
1818

1919
.. tab:: Shell
2020

21-
.. code:: shell
22-
23-
unstructured-ingest \
24-
s3 \
25-
--remote-url s3://utic-dev-tech-fixtures/small-pdf-set/ \
26-
--anonymous \
27-
--output-dir s3-small-batch-output-to-azure \
28-
--num-processes 2 \
29-
--verbose \
30-
--strategy fast \
31-
azure-cognitive-search \
32-
--key "$AZURE_SEARCH_API_KEY" \
33-
--endpoint "$AZURE_SEARCH_ENDPOINT" \
34-
--index utic-test-ingest-fixtures-output
21+
.. literalinclude:: ./code/bash/azure_cognitive_search.sh
22+
:language: bash
3523

3624
.. tab:: Python
3725

38-
.. code:: python
39-
40-
import os
41-
import subprocess
42-
43-
command = [
44-
"unstructured-ingest",
45-
"s3",
46-
"--remote-url", "s3://utic-dev-tech-fixtures/small-pdf-set/",
47-
"--anonymous",
48-
"--output-dir", "s3-small-batch-output-to-azure",
49-
"--num-processes", "2",
50-
"--verbose",
51-
"--strategy", "fast",
52-
"azure-cognitive-search",
53-
"--key", os.getenv("AZURE_SEARCH_API_KEY"),
54-
"--endpoint", os.getenv("$AZURE_SEARCH_ENDPOINT"),
55-
"--index", "utic-test-ingest-fixtures-output",
56-
]
57-
58-
# Run the command
59-
process = subprocess.Popen(command, stdout=subprocess.PIPE)
60-
output, error = process.communicate()
61-
62-
# Print output
63-
if process.returncode == 0:
64-
print("Command executed successfully. Output:")
65-
print(output.decode())
66-
else:
67-
print("Command failed. Error:")
68-
print(error.decode())
26+
.. literalinclude:: ./code/python/azure_cognitive_search.py
27+
:language: python
6928

7029

7130
For a full list of the options the CLI accepts check ``unstructured-ingest <upstream connector> azure-cognitive-search --help``.
@@ -77,7 +36,7 @@ Sample Index Schema
7736

7837
To make sure the schema of the index matches the data being written to it, a sample schema json can be used:
7938

80-
.. literalinclude:: azure_cognitive_sample_index_schema.json
39+
.. literalinclude:: ./data/azure_cognitive_sample_index_schema.json
8140
:language: json
8241
:linenos:
8342
:caption: Object description
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
Box
2+
===========
3+
4+
Batch process all your records using ``unstructured-ingest`` to store structured outputs locally on your filesystem and upload those local files to a Box folder.
5+
6+
First you'll need to install the Box dependencies as shown here.
7+
8+
.. code:: shell
9+
10+
pip install "unstructured[box]"
11+
12+
Run Locally
13+
-----------
14+
The upstream connector can be any of the ones supported, but for convenience here, showing a sample command using the
15+
upstream local connector.
16+
17+
.. tabs::
18+
19+
.. tab:: Shell
20+
21+
.. literalinclude:: ./code/bash/box.sh
22+
:language: bash
23+
24+
.. tab:: Python
25+
26+
.. literalinclude:: ./code/python/box.py
27+
:language: python
28+
29+
30+
For a full list of the options the CLI accepts check ``unstructured-ingest <upstream connector> box --help``.
31+
32+
NOTE: Keep in mind that you will need to have all the appropriate extras and dependencies for the file types of the documents contained in your data storage platform if you're running this locally. You can find more information about this in the `installation guide <https://unstructured-io.github.io/unstructured/installing.html>`_.
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
#!/usr/bin/env bash
2+
3+
EMBEDDING_PROVIDER=${EMBEDDING_PROVIDER:-"langchain-huggingface"}
4+
5+
unstructured-ingest \
6+
local \
7+
--input-path example-docs/book-war-and-peace-1225p.txt \
8+
--output-dir local-output-to-azure \
9+
--strategy fast \
10+
--chunk-elements \
11+
--embedding-provider "$EMBEDDING_PROVIDER" \
12+
--num-processes 2 \
13+
--verbose \
14+
azure \
15+
--account-name azureunstructured1 \
16+
--remote-url "<your destination path here, ie 'az://unstructured/war-and-peace-output'>"
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
#!/usr/bin/env bash
2+
EMBEDDING_PROVIDER=${EMBEDDING_PROVIDER:-"langchain-huggingface"}
3+
4+
unstructured-ingest \
5+
local \
6+
--input-path example-docs/book-war-and-peace-1225p.txt \
7+
--output-dir local-output-to-azure-cog-search \
8+
--strategy fast \
9+
--chunk-elements \
10+
--embedding-provider "$EMBEDDING_PROVIDER" \
11+
--num-processes 2 \
12+
--verbose \
13+
azure-cognitive-search \
14+
--key "$AZURE_SEARCH_API_KEY" \
15+
--endpoint "$AZURE_SEARCH_ENDPOINT" \
16+
--index utic-test-ingest-fixtures-output
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
#!/usr/bin/env bash
2+
3+
EMBEDDING_PROVIDER=${EMBEDDING_PROVIDER:-"langchain-huggingface"}
4+
5+
unstructured-ingest \
6+
local \
7+
--input-path example-docs/book-war-and-peace-1225p.txt \
8+
--output-dir local-output-to-box \
9+
--strategy fast \
10+
--chunk-elements \
11+
--embedding-provider "$EMBEDDING_PROVIDER" \
12+
--num-processes 2 \
13+
--verbose \
14+
box \
15+
--box_app_config "$BOX_APP_CONFIG_PATH" \
16+
--remote-url "<your destination path here, ie 'box://unstructured/war-and-peace-output'>"
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/usr/bin/env bash
2+
3+
EMBEDDING_PROVIDER=${EMBEDDING_PROVIDER:-"langchain-huggingface"}
4+
5+
unstructured-ingest \
6+
local \
7+
--input-path example-docs/book-war-and-peace-1225p.txt \
8+
--output-dir local-output-to-delta-table \
9+
--strategy fast \
10+
--chunk-elements \
11+
--embedding-provider "$EMBEDDING_PROVIDER" \
12+
--num-processes 2 \
13+
--verbose \
14+
delta-table \
15+
--table-uri delta-table-dest
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
#!/usr/bin/env bash
2+
3+
EMBEDDING_PROVIDER=${EMBEDDING_PROVIDER:-"langchain-huggingface"}
4+
5+
unstructured-ingest \
6+
local \
7+
--input-path example-docs/book-war-and-peace-1225p.txt \
8+
--output-dir local-output-to-dropbox \
9+
--strategy fast \
10+
--chunk-elements \
11+
--embedding-provider "$EMBEDDING_PROVIDER" \
12+
--num-processes 2 \
13+
--verbose \
14+
dropbox \
15+
--token "$DROPBOX_TOKEN" \
16+
--remote-url "<your destination path here, ie 'dropbox://unstructured/war-and-peace-output'>"
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
#!/usr/bin/env bash
2+
3+
EMBEDDING_PROVIDER=${EMBEDDING_PROVIDER:-"langchain-huggingface"}
4+
5+
unstructured-ingest \
6+
local \
7+
--input-path example-docs/book-war-and-peace-1225p.txt \
8+
--output-dir local-output-to-gcs \
9+
--strategy fast \
10+
--chunk-elements \
11+
--embedding-provider "$EMBEDDING_PROVIDER" \
12+
--num-processes 2 \
13+
--verbose \
14+
gcs \
15+
--service-account-key "$SERVICE_ACCOUNT_KEY" \
16+
--remote-url "<your destination path here, ie 'gcs://unstructured/war-and-peace-output'>"

0 commit comments

Comments
 (0)