Skip to content

Commit c35fff2

Browse files
authored
feat: Add stage_for_weaviate and schema creation function (#672)
* add weaviate docker compose * added staging brick and tests for weaviate * initial notebook and requirements file * add commentary to weaviate notebook * weaviate readme * update docs * version and change log * install weaviate client * install weaviate; skip for docker * linting, linting, linting * install weaviate client with deps * comments on weaviate client * fix module not found error for docker container * skipped wrong test in docker * fix typos * add in local-inference
1 parent cf70c86 commit c35fff2

File tree

11 files changed

+455
-3
lines changed

11 files changed

+455
-3
lines changed

.github/workflows/ci.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,9 @@ jobs:
138138
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
139139
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
140140
tesseract --version
141+
# NOTE(robinson) - Installing weaviate-client separately here because the requests
142+
# version conflicts with label_studio_sdk
143+
pip install weaviate-client
141144
make test
142145
make check-coverage
143146

CHANGELOG.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,12 @@
22

33
### Enhancements
44

5-
* Builds from Unstructured base image, built off of Rocky Linux 8.7, this resolves almost all CVE's in the image.
6-
75
### Features
86

7+
* Add `stage_for_weaviate` to stage `unstructured` outputs for upload to Weaviate, along with
8+
a helper function for defining a class to use in Weaviate schemas.
9+
* Builds from Unstructured base image, built off of Rocky Linux 8.7, this resolves almost all CVE's in the image.
10+
911
### Fixes
1012

1113
## 0.7.0

Makefile

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,9 @@ install-nltk-models:
4141
.PHONY: install-test
4242
install-test:
4343
python3 -m pip install -r requirements/test.txt
44+
# NOTE(robinson) - Installing weaviate-client separately here because the requests
45+
# version conflicts with label_studio_sdk
46+
python3 -m pip install weaviate-client
4447

4548
.PHONY: install-dev
4649
install-dev:
@@ -245,4 +248,4 @@ docker-jupyter-notebook:
245248

246249
.PHONY: run-jupyter
247250
run-jupyter:
248-
PYTHONPATH=$(realpath .) JUPYTER_PATH=$(realpath .) jupyter-notebook --NotebookApp.token='' --NotebookApp.password=''
251+
PYTHONPATH=$(realpath .) JUPYTER_PATH=$(realpath .) jupyter-notebook --NotebookApp.token='' --NotebookApp.password=''

docs/source/bricks.rst

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1554,6 +1554,58 @@ See the `LabelStudio docs <https://labelstud.io/tags/labels.html>`_ for a full l
15541554
for labels and annotations.
15551555

15561556

1557+
``stage_for_weaviate``
1558+
-----------------------
1559+
1560+
The ``stage_for_weaviate`` staging function prepares a list of ``Element`` objects for ingestion into
1561+
the `Weaviate <https://weaviate.io/>`_ vector database. You can create a schema in Weaviate
1562+
for the `unstructured` outputs using the following workflow:
1563+
1564+
.. code:: python
1565+
1566+
from unstructured.staging.weaviate import create_unstructured_weaviate_class
1567+
1568+
import weaviate
1569+
1570+
# Change `class_name` if you want the class for unstructured documents in Weaviate
1571+
# to have a different name
1572+
unstructured_class = create_unstructured_weaviate_class(class_name="UnstructuredDocument")
1573+
schema = {"classes": [unstructured_class]}
1574+
1575+
client = weaviate.Client("http://localhost:8080")
1576+
client.schema.create(schema)
1577+
1578+
1579+
Once the schema is created, you can batch upload documents to Weaviate using the following workflow.
1580+
See the `Weaviate documentation <https://weaviate.io/developers/weaviate>`_ for more details on
1581+
options for uploading data and querying data once it has been uploaded.
1582+
1583+
1584+
.. code:: python
1585+
1586+
from unstructured.partition.pdf import partition_pdf
1587+
from unstructured.staging.weaviate import stage_for_weaviate
1588+
1589+
import weaviate
1590+
from weaviate.util import generate_uuid5
1591+
1592+
1593+
filename = "example-docs/layout-parser-paper-fast.pdf"
1594+
elements = partition_pdf(filename=filename, strategy="fast")
1595+
data_objects = stage_for_weaviate(elements)
1596+
1597+
client = weaviate.Client("http://localhost:8080")
1598+
1599+
with client.batch(batch_size=10) as batch:
1600+
for data_object in tqdm.tqdm(data_objects):
1601+
batch.add_data_object(
1602+
data_object,
1603+
unstructured_class_name,
1604+
uuid=generate_uuid5(data_object),
1605+
)
1606+
1607+
1608+
15571609
``stage_for_baseplate``
15581610
-----------------------
15591611

docs/source/integrations.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,3 +75,13 @@ the text from each element and their types such as ``NarrativeText`` or ``Title`
7575
-----------------------------
7676
You can format your JSON or CSV outputs for use with `Prodigy <https://prodi.gy/docs/api-loaders>`_ using the `stage_for_prodigy <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-prodigy>`_ and `stage_csv_for_prodigy <https://unstructured-io.github.io/unstructured/bricks.html#stage-csv-for-prodigy>`_ staging bricks. After running ``stage_for_prodigy`` |
7777
``stage_csv_for_prodigy``, you can write the results to a ``.json`` | ``.jsonl`` or a ``.csv`` file that is ready to be used with Prodigy. Follow the links for more details on usage.
78+
79+
80+
``Integration with Weaviate``
81+
-----------------------------
82+
`Weaviate <https://weaviate.io/>`_ is an open-source vector database that allows you to store data objects and vector embeddings
83+
from a variety of ML models. Storing text and embeddings in a vector database such as Weaviate is a key component of the
84+
`emerging LLM tech stack <https://medium.com/@unstructured-io/llms-and-the-emerging-ml-tech-stack-bdb189c8be5c>`_.
85+
See the `stage_for_weaviate <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-weaviate>`_ docs for details
86+
on how to upload ``unstructured`` outputs to Weaviate. An example notebook is also available
87+
`here <https://github.com/Unstructured-IO/unstructured/tree/main/examples/weaviate>`_.

examples/weaviate/README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
## Uploading data to Weaviate with `unstructured`
2+
3+
The example notebook in this directory shows how to upload documents to Weaviate using the
4+
`unstructured` library. To get started with the notebook, use the following steps:
5+
6+
- Run `pip install -r requirements.txt` to install the requirements.
7+
- Run `docker-compose up` to run the Weaviate container.
8+
- Run `jupyter-notebook` to start the notebook.
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
version: '3.4'
2+
services:
3+
weaviate:
4+
image: semitechnologies/weaviate:1.19.6
5+
restart: on-failure:0
6+
ports:
7+
- "8080:8080"
8+
environment:
9+
QUERY_DEFAULTS_LIMIT: 20
10+
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
11+
PERSISTENCE_DATA_PATH: "./data"
12+
DEFAULT_VECTORIZER_MODULE: text2vec-transformers
13+
ENABLE_MODULES: text2vec-transformers
14+
TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080
15+
CLUSTER_HOSTNAME: 'node1'
16+
t2v-transformers:
17+
image: semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1
18+
environment:
19+
ENABLE_CUDA: 0 # set to 1 to enable
20+
# NVIDIA_VISIBLE_DEVICES: all # enable if running with CUDA

examples/weaviate/requirements.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
jupyter
2+
tqdm
3+
weaviate-client
4+
unstructured[local-inference]

examples/weaviate/weaviate.ipynb

Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "a3ce962e",
6+
"metadata": {},
7+
"source": [
8+
"## Loading Data into Weaviate with `unstructured`\n",
9+
"\n",
10+
"This notebook shows a basic workflow for uploading document elements into Weaviate using the `unstructured` library. To get started with this notebook, first install the dependencies with `pip install -r requirements.txt` and start the Weaviate docker container with `docker-compose up`."
11+
]
12+
},
13+
{
14+
"cell_type": "code",
15+
"execution_count": 1,
16+
"id": "5d9ffc17",
17+
"metadata": {},
18+
"outputs": [],
19+
"source": [
20+
"import json\n",
21+
"\n",
22+
"import tqdm\n",
23+
"from unstructured.partition.pdf import partition_pdf\n",
24+
"from unstructured.staging.weaviate import create_unstructured_weaviate_class, stage_for_weaviate\n",
25+
"import weaviate\n",
26+
"from weaviate.util import generate_uuid5"
27+
]
28+
},
29+
{
30+
"cell_type": "markdown",
31+
"id": "673715e9",
32+
"metadata": {},
33+
"source": [
34+
"The first step is to partition the document using the `unstructured` library. In the following example, we partition a PDF with `partition_pdf`. You can also partition over a dozen document types with the `partition` function."
35+
]
36+
},
37+
{
38+
"cell_type": "code",
39+
"execution_count": 2,
40+
"id": "f9fc0cf9",
41+
"metadata": {},
42+
"outputs": [],
43+
"source": [
44+
"filename = \"../../example-docs/layout-parser-paper-fast.pdf\"\n",
45+
"elements = partition_pdf(filename=filename, strategy=\"fast\")"
46+
]
47+
},
48+
{
49+
"cell_type": "markdown",
50+
"id": "3ae76364",
51+
"metadata": {},
52+
"source": [
53+
"Next, we'll create a schema for our Weaviate database using the `create_unstructured_weaviate_class` helper function from the `unstructured` library. The helper function generates a schema that includes all of the elements in the `ElementMetadata` object from `unstructured`. This includes information such as the filename and the page number of the document element. After specifying the schema, we create a connection to the database with the Weaviate client library and create the schema. You can change the name of the class by updating the `unstructured_class_name` variable."
54+
]
55+
},
56+
{
57+
"cell_type": "code",
58+
"execution_count": 3,
59+
"id": "91057cb1",
60+
"metadata": {},
61+
"outputs": [],
62+
"source": [
63+
"unstructured_class_name = \"UnstructuredDocument\""
64+
]
65+
},
66+
{
67+
"cell_type": "code",
68+
"execution_count": 4,
69+
"id": "78e804bb",
70+
"metadata": {},
71+
"outputs": [],
72+
"source": [
73+
"unstructured_class = create_unstructured_weaviate_class(unstructured_class_name)\n",
74+
"schema = {\"classes\": [unstructured_class]} "
75+
]
76+
},
77+
{
78+
"cell_type": "code",
79+
"execution_count": 5,
80+
"id": "3e317a2d",
81+
"metadata": {},
82+
"outputs": [],
83+
"source": [
84+
"client = weaviate.Client(\"http://localhost:8080\")"
85+
]
86+
},
87+
{
88+
"cell_type": "code",
89+
"execution_count": 6,
90+
"id": "0c508784",
91+
"metadata": {},
92+
"outputs": [],
93+
"source": [
94+
"client.schema.create(schema)"
95+
]
96+
},
97+
{
98+
"cell_type": "markdown",
99+
"id": "024ae133",
100+
"metadata": {},
101+
"source": [
102+
"Next, we stage the elements for Weaviate using the `stage_for_weaviate` function and batch upload the results to Weaviate. `stage_for_weaviate` outputs a dictionary that conforms to the schema we created earlier. Once that data is stage, we can use the Weaviate client library to batch upload the results to Weaviate."
103+
]
104+
},
105+
{
106+
"cell_type": "code",
107+
"execution_count": 7,
108+
"id": "a7018bb1",
109+
"metadata": {},
110+
"outputs": [],
111+
"source": [
112+
"data_objects = stage_for_weaviate(elements)"
113+
]
114+
},
115+
{
116+
"cell_type": "code",
117+
"execution_count": 8,
118+
"id": "af712d8e",
119+
"metadata": {},
120+
"outputs": [
121+
{
122+
"name": "stderr",
123+
"output_type": "stream",
124+
"text": [
125+
"100%|██████████████████████████████████████████████████████████████████████| 28/28 [00:46<00:00, 1.66s/it]\n"
126+
]
127+
}
128+
],
129+
"source": [
130+
"with client.batch(batch_size=10) as batch:\n",
131+
" for data_object in tqdm.tqdm(data_objects):\n",
132+
" batch.add_data_object(\n",
133+
" data_object,\n",
134+
" unstructured_class_name,\n",
135+
" uuid=generate_uuid5(data_object),\n",
136+
" )"
137+
]
138+
},
139+
{
140+
"cell_type": "markdown",
141+
"id": "dac10bf5",
142+
"metadata": {},
143+
"source": [
144+
"Now that the documents are in Weaviate, we're able to run queries against Weaviate!"
145+
]
146+
},
147+
{
148+
"cell_type": "code",
149+
"execution_count": 9,
150+
"id": "14098434",
151+
"metadata": {},
152+
"outputs": [
153+
{
154+
"name": "stdout",
155+
"output_type": "stream",
156+
"text": [
157+
"{\n",
158+
" \"data\": {\n",
159+
" \"Get\": {\n",
160+
" \"UnstructuredDocument\": [\n",
161+
" {\n",
162+
" \"text\": \"Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks including document image classi\\ufb01cation [11,\"\n",
163+
" }\n",
164+
" ]\n",
165+
" }\n",
166+
" }\n",
167+
"}\n"
168+
]
169+
}
170+
],
171+
"source": [
172+
"near_text = {\"concepts\": [\"document understanding\"]}\n",
173+
"\n",
174+
"result = (\n",
175+
" client.query\n",
176+
" .get(\"UnstructuredDocument\", [\"text\"])\n",
177+
" .with_near_text(near_text)\n",
178+
" .with_limit(1)\n",
179+
" .do()\n",
180+
")\n",
181+
"\n",
182+
"print(json.dumps(result, indent=4))"
183+
]
184+
},
185+
{
186+
"cell_type": "code",
187+
"execution_count": null,
188+
"id": "c191217c",
189+
"metadata": {},
190+
"outputs": [],
191+
"source": []
192+
}
193+
],
194+
"metadata": {
195+
"kernelspec": {
196+
"display_name": "Python 3 (ipykernel)",
197+
"language": "python",
198+
"name": "python3"
199+
},
200+
"language_info": {
201+
"codemirror_mode": {
202+
"name": "ipython",
203+
"version": 3
204+
},
205+
"file_extension": ".py",
206+
"mimetype": "text/x-python",
207+
"name": "python",
208+
"nbconvert_exporter": "python",
209+
"pygments_lexer": "ipython3",
210+
"version": "3.8.13"
211+
}
212+
},
213+
"nbformat": 4,
214+
"nbformat_minor": 5
215+
}

0 commit comments

Comments
 (0)