Skip to content

Commit 9177aa2

Browse files
feature CORE-3985: add Clarifai destination connector (#2633)
Thanks to @mogith-pn from Clarifai we have a new destination connector! This PR intends to add Clarifai as a ingest destination connector. Access via CLI and programmatic. Documentation and Examples. Integration test script.
1 parent 469f878 commit 9177aa2

File tree

20 files changed

+793
-3
lines changed

20 files changed

+793
-3
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -398,6 +398,7 @@ jobs:
398398
VECTARA_CUSTOMER_ID: ${{secrets.VECTARA_CUSTOMER_ID}}
399399
ASTRA_DB_TOKEN: ${{secrets.ASTRA_DB_TOKEN}}
400400
ASTRA_DB_ENDPOINT: ${{secrets.ASTRA_DB_ENDPOINT}}
401+
CLARIFAI_API_KEY: ${{secrets.CLARIFAI_API_KEY}}
401402
TABLE_OCR: "tesseract"
402403
OCR_AGENT: "unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract"
403404
CI: "true"

CHANGELOG.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.12.7-dev8
1+
## 0.12.7-dev9
22

33
### Enhancements
44

@@ -8,6 +8,7 @@
88
### Features
99

1010
* **Chunking populates `.metadata.orig_elements` for each chunk.** This behavior allows the text and metadata of the elements combined to make each chunk to be accessed. This can be important for example to recover metadata such as `.coordinates` that cannot be consolidated across elements and so is dropped from chunks. This option is controlled by the `include_orig_elements` parameter to `partition_*()` or to the chunking functions. This option defaults to `True` so original-elements are preserved by default. This behavior is not yet supported via the REST APIs or SDKs but will be in a closely subsequent PR to other `unstructured` repositories. The original elements will also not serialize or deserialize yet; this will also be added in a closely subsequent PR.
11+
* **Add Clarifai destination connector** Adds support for writing partitioned and chunked documents into Clarifai.
1112

1213
### Fixes
1314

@@ -24,6 +25,7 @@
2425
* **Redefine `table_level_acc` metric for table evaluation.** `table_level_acc` now is an average of individual predicted table's accuracy. A predicted table's accuracy is defined as the sequence matching ratio between itself and its corresponding ground truth table.
2526

2627
### Features
28+
2729
* **Added Unstructured Platform Documentation** The Unstructured Platform is currently in beta. The documentation provides how-to guides for setting up workflow automation, job scheduling, and configuring source and destination connectors.
2830

2931
### Fixes

Makefile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -251,6 +251,10 @@ install-ingest-databricks-volumes:
251251
install-ingest-astra:
252252
python3 -m pip install -r requirements/ingest/astra.txt
253253

254+
.PHONY: install-ingest-clarifai
255+
install-ingest-clarifai:
256+
python3 -m pip install -r requirements/ingest/clarifai.txt
257+
254258
.PHONY: install-embed-huggingface
255259
install-embed-huggingface:
256260
python3 -m pip install -r requirements/ingest/embed-huggingface.txt

docs/source/ingest/destination_connectors.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ in our community `Slack. <https://short.unstructured.io/pzw05l7>`_
1313
destination_connectors/azure_cognitive_search
1414
destination_connectors/box
1515
destination_connectors/chroma
16+
destination_connectors/clarifai
1617
destination_connectors/databricks_volumes
1718
destination_connectors/delta_table
1819
destination_connectors/dropbox
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
Clarifai
2+
===========
3+
4+
Batch process all your records using ``unstructured-ingest`` to store unstructured outputs locally on your filesystem and upload those to Clarifai apps.
5+
6+
First start with the installation of clarifai dependencies as shown here.
7+
8+
.. code:: shell
9+
10+
pip install "unstructured[clarifai]"
11+
12+
Create a clarifai app with base workflow. Find more information in the `create clarifai app <https://docs.clarifai.com/clarifai-basics/applications/create-an-application/>`_.
13+
14+
Run Locally
15+
-----------
16+
The upstream connector can be any of the ones supported, but for the convenience here, showing a sample command using the upstream local connector.
17+
18+
.. tabs::
19+
20+
.. tab:: Shell
21+
22+
.. literatinclude:: ./code/bash/clarifai.sh
23+
:language: bash
24+
25+
.. tab:: Python
26+
27+
.. literalinclude:: ./code/python/clarifai.py
28+
:language: python
29+
30+
For a full list of the options the CLI accepts check ``unstructured-ingest <upstream connector> clarifai --help``.
31+
32+
NOTE: Keep in mind that you will need to have all the appropriate extras and dependencies for the file types of the documents contained in your data storage platform if you're running this locally. You can find more information about this in the `installation guide <https://unstructured-io.github.io/unstructured/installing.html>`_.
33+
34+
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/usr/bin/env bash
2+
3+
unstructured-ingest \
4+
local \
5+
--input-path example-docs/book-war-and-peace-1225p.txt \
6+
--output-dir local-output-to-clarifai \
7+
--strategy fast \
8+
--chunk-elements \
9+
--num-processes 2 \
10+
--verbose \
11+
clarifai \
12+
--app-id "<your clarifai app name>" \
13+
--user-id "<your clarifai user id>" \
14+
--api-key "<your clarifai PAT key>" \
15+
--batch-size 100
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
from unstructured.ingest.connector.clarifai import (
2+
ClarifaiAccessConfig,
3+
ClarifaiWriteConfig,
4+
SimpleClarifaiConfig,
5+
)
6+
from unstructured.ingest.connector.local import SimpleLocalConfig
7+
from unstructured.ingest.interfaces import (
8+
ChunkingConfig,
9+
PartitionConfig,
10+
ProcessorConfig,
11+
ReadConfig,
12+
)
13+
from unstructured.ingest.runner import LocalRunner
14+
from unstructured.ingest.runner.writers.base_writer import Writer
15+
from unstructured.ingest.runner.writers.clarifai import (
16+
ClarifaiWriter,
17+
)
18+
19+
20+
def get_writer() -> Writer:
21+
return ClarifaiWriter(
22+
connector_config=SimpleClarifaiConfig(
23+
access_config=ClarifaiAccessConfig(api_key="CLARIFAI_PAT"),
24+
app_id="CLARIFAI_APP",
25+
user_id="CLARIFAI_USER_ID",
26+
),
27+
write_config=ClarifaiWriteConfig(),
28+
)
29+
30+
31+
if __name__ == "__main__":
32+
writer = get_writer()
33+
runner = LocalRunner(
34+
processor_config=ProcessorConfig(
35+
verbose=True,
36+
output_dir="local-output-to-clarifai-app",
37+
num_processes=2,
38+
),
39+
connector_config=SimpleLocalConfig(
40+
input_path="example-docs/book-war-and-peace-1225p.txt",
41+
),
42+
read_config=ReadConfig(),
43+
partition_config=PartitionConfig(),
44+
chunking_config=ChunkingConfig(chunk_elements=True),
45+
writer=writer,
46+
writer_kwargs={},
47+
)
48+
runner.run()

docs/source/introduction/key_concepts.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ A RAG workflow can be broken down into the following steps:
6666

6767
4. **Embedding**: After chunking, you must convert the text into a numerical representation (vector embedding) that an LLM can understand. To use the various embedding models using Unstructured tools, please refer to `this page <https://unstructured-io.github.io/unstructured/core/embedding.html>`__.
6868

69-
5. **Vector Database**: The next step is to choose a location for storing your chunked embeddings. There are many options for your vector database (ChromaDB, Milvus, Pinecone, Qdrant, Weaviate, and more). For complete list of Unstructured ``Destination Connectors``, please visit `this page <https://unstructured-io.github.io/unstructured/ingest/destination_connectors.html>`__.
69+
5. **Vector Database**: The next step is to choose a location for storing your chunked embeddings. There are many options for your vector database (AstraDB, ChromaDB, Clarifai, Milvus, Pinecone, Qdrant, Weaviate, and more). For complete list of Unstructured ``Destination Connectors``, please visit `this page <https://unstructured-io.github.io/unstructured/ingest/destination_connectors.html>`__.
7070

7171
6. **User Prompt**: Take the user prompt and grab the most relevant chunks of information in the vector database via similarity search.
7272

examples/ingest/clarifai/ingest.sh

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
#!/usr/bin/env bash
2+
3+
# Uploads the structured output of the files within the given path to a clarifai app.
4+
5+
SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
6+
cd "$SCRIPT_DIR"/../../.. || exit 1
7+
8+
PYTHONPATH=. ./unstructured/ingest/main.py \
9+
local \
10+
--input-path example-docs/book-war-and-peace-1225p.txt \
11+
--output-dir local-output-to-clarifai \
12+
--strategy fast \
13+
--chunk-elements \
14+
--num-processes 2 \
15+
--verbose \
16+
clarifai \
17+
--app-id "<your clarifai app name>" \
18+
--user-id "<your clarifai user id>" \
19+
--api-key "<your clarifai PAT key>" \
20+
--batch-size 100

requirements/ingest/clarifai.in

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
-c ../constraints.in
2+
-c ../base.txt
3+
clarifai

0 commit comments

Comments
 (0)