Skip to content

Commit 74dfc2e

Browse files
authored
Weaviate destination connector: metadata field guidance for Weaviate collection schemas (#823)
1 parent 0b7e106 commit 74dfc2e

File tree

1 file changed

+178
-65
lines changed

1 file changed

+178
-65
lines changed

snippets/general-shared-text/weaviate.mdx

Lines changed: 178 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939
You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again.
4040
- If a collection name is not specified, Unstructured creates a new collection in your Weaviate cluster. The new collection's name will be `Unstructuredautocreated`.
4141

42-
If Unstructured creates a new collection and generates embeddings, you will not see an embeddings property in tools such as the Weaviate Cloud
42+
If Unstructured creates a new collection and generates embeddings, you will not see an `embeddings` property in tools such as the Weaviate Cloud
4343
**Collections** user interface. To view the generated embeddings, you can run a Weaviate GraphQL query such as the following. In this query, replace `<collection-name>` with
4444
the name of the new collection, and replace `<property-name>` with the name of each additional available property that
4545
you want to return results for, such as `text`, `type`, `element_id`, `record_id`, and so on. The embeddings will be
@@ -59,81 +59,194 @@
5959
}
6060
```
6161

62-
Weaviate requires an existing collection to have a data schema before you add data. At minimum, this schema must contain the `record_id` property, as follows:
62+
If [auto-schema](https://docs.weaviate.io/weaviate/config-refs/collections#auto-schema) is enabled in Weaviate (which it is by default),
63+
Weaviate can infer missing properties and add them to the collection definition at run time. However, it is a Weaviate best practice to manually define as much
64+
of the data schema in advance as possible, since manual definition gives you the most control.
65+
66+
The minimum viable schema for Unstructured includes only the `element_id` and `record_id` properties. The `text` and `type` properties should also be included, but they are technically optional.
67+
If you are using Unstructured to generate embeddings, you must
68+
69+
The following code example shows how to use the [weaviate-client](https://pypi.org/project/weaviate-client/) Python package to create a
70+
collection in a Weaviate Cloud database cluster with this minimum viable schema, and to specify that Unstructured will generate the embeddings for this collection.
71+
To connect to a locally hosted Weaviate instance instead, call [weaviate.connect_to_local](https://docs.weaviate.io/weaviate/connections/connect-local).
72+
To connect to Embedded Weaviate instead, call [weaviate.connect_to_embedded](https://docs.weaviate.io/weaviate/connections/connect-embedded).
73+
74+
```python
75+
import os
76+
import weaviate
77+
from weaviate.classes.init import Auth
78+
import weaviate.classes.config as wvc
79+
80+
client = weaviate.connect_to_weaviate_cloud(
81+
cluster_url=os.getenv("WEAVIATE_URL"),
82+
auth_credentials=Auth.api_key(api_key=os.getenv("WEAVIATE_API_KEY")),
83+
)
84+
85+
collection = client.collections.create(
86+
name="MyCollection",
87+
properties=[
88+
wvc.Property(name="element_id", data_type=wvc.DataType.UUID),
89+
wvc.Property(name="record_id", data_type=wvc.DataType.TEXT),
90+
wvc.Property(name="text", data_type=wvc.DataType.TEXT),
91+
wvc.Property(name="type", data_type=wvc.DataType.TEXT),
92+
],
93+
vectorizer_config=None, # Unstructured will generate the embeddings instead of Weaviate.
94+
)
95+
96+
client.close()
97+
```
98+
99+
For objects in the `metadata` field that Unstructured produces and that you want to store in a Weaviate collection, be sure to follow
100+
Unstructured's `metadata` field naming convention. For example, if Unstructured produces a `metadata` field with the following
101+
child objects:
63102

64103
```json
65-
{
66-
"class": "Elements",
67-
"properties": [
68-
{
69-
"name": "record_id",
70-
"dataType": ["text"]
71-
}
104+
"metadata": {
105+
"is_extracted": "true",
106+
"coordinates": {
107+
"points": [
108+
[
109+
134.20055555555555,
110+
241.36027777777795
111+
],
112+
[
113+
134.20055555555555,
114+
420.0269444444447
115+
],
116+
[
117+
529.7005555555555,
118+
420.0269444444447
119+
],
120+
[
121+
529.7005555555555,
122+
241.36027777777795
123+
]
124+
],
125+
"system": "PixelSpace",
126+
"layout_width": 1654,
127+
"layout_height": 2339
128+
},
129+
"filetype": "application/pdf",
130+
"languages": [
131+
"eng"
132+
],
133+
"page_number": 1,
134+
"image_mime_type": "image/jpeg",
135+
"filename": "realestate.pdf",
136+
"data_source": {
137+
"url": "file:///home/etl/node/downloads/00000000-0000-0000-0000-000000000001/7458635f-realestate.pdf",
138+
"record_locator": {
139+
"protocol": "file",
140+
"remote_file_path": "file:///home/etl/node/downloads/00000000-0000-0000-0000-000000000001/7458635f-realestate.pdf"
141+
}
142+
},
143+
"entities": {
144+
"items": [
145+
{
146+
"entity": "HOME FOR FUTURE",
147+
"type": "ORGANIZATION"
148+
},
149+
{
150+
"entity": "221 Queen Street, Melbourne VIC 3000",
151+
"type": "LOCATION"
152+
}
153+
],
154+
"relationships": [
155+
{
156+
"from": "HOME FOR FUTURE",
157+
"relationship": "based_in",
158+
"to": "221 Queen Street, Melbourne VIC 3000"
159+
}
72160
]
161+
}
73162
}
74163
```
75164

76-
Weaviate generates any additional properties based on the incoming data.
165+
You could create corresponding properties in your collection's schema by using the following property names and data types:
77166

78-
If you have specific schema requirements, you can define the schema manually.
79-
Unstructured cannot provide a schema that is guaranteed to work for everyone in all circumstances.
80-
This is because these schemas will vary based on
81-
your source files' types; how you want Unstructured to partition, chunk, and generate embeddings;
82-
any custom post-processing code that you run; and other factors.
167+
```python
168+
import os
169+
import weaviate
170+
from weaviate.classes.init import Auth
171+
import weaviate.classes.config as wvc
83172

84-
You can adapt the following collection schema example for your own specific schema requirements:
173+
client = weaviate.connect_to_weaviate_cloud(
174+
cluster_url=os.getenv("WEAVIATE_URL"),
175+
auth_credentials=Auth.api_key(api_key=os.getenv("WEAVIATE_API_KEY")),
176+
)
85177

86-
```json
87-
{
88-
"class": "Elements",
89-
"properties": [
90-
{
91-
"name": "record_id",
92-
"dataType": ["text"]
93-
},
94-
{
95-
"name": "element_id",
96-
"dataType": ["text"]
97-
},
98-
{
99-
"name": "text",
100-
"dataType": ["text"]
101-
},
102-
{
103-
"name": "embeddings",
104-
"dataType": ["number[]"]
105-
},
106-
{
107-
"name": "metadata",
108-
"dataType": ["object"],
109-
"nestedProperties": [
110-
{
111-
"name": "parent_id",
112-
"dataType": ["text"]
113-
},
114-
{
115-
"name": "page_number",
116-
"dataType": ["text"]
117-
},
118-
{
119-
"name": "is_continuation",
120-
"dataType": ["boolean"]
121-
},
122-
{
123-
"name": "orig_elements",
124-
"dataType": ["text"]
125-
},
126-
{
127-
"name": "partitioner_type",
128-
"dataType": ["text"]
129-
}
130-
]
131-
}
132-
]
133-
}
178+
collection = client.collections.create(
179+
name="MyCollection",
180+
properties=[
181+
wvc.Property(name="element_id", data_type=wvc.DataType.UUID),
182+
wvc.Property(name="record_id", data_type=wvc.DataType.TEXT),
183+
wvc.Property(name="text", data_type=wvc.DataType.TEXT),
184+
wvc.Property(name="type", data_type=wvc.DataType.TEXT),
185+
wvc.Property(
186+
name="metadata",
187+
data_type=wvc.DataType.OBJECT,
188+
nested_properties=[
189+
wvc.Property(name="is_extracted", data_type=wvc.DataType.TEXT),
190+
wvc.Property(
191+
name="coordinates",
192+
data_type=wvc.DataType.OBJECT,
193+
nested_properties=[
194+
wvc.Property(name="points", data_type=wvc.DataType.TEXT),
195+
wvc.Property(name="system", data_type=wvc.DataType.TEXT),
196+
wvc.Property(name="layout_width", data_type=wvc.DataType.NUMBER),
197+
wvc.Property(name="layout_height", data_type=wvc.DataType.NUMBER),
198+
],
199+
),
200+
wvc.Property(name="filetype", data_type=wvc.DataType.TEXT),
201+
wvc.Property(name="languages", data_type=wvc.DataType.TEXT_ARRAY),
202+
wvc.Property(name="page_number", data_type=wvc.DataType.TEXT),
203+
wvc.Property(name="image_mime_type", data_type=wvc.DataType.TEXT),
204+
wvc.Property(name="filename", data_type=wvc.DataType.TEXT),
205+
wvc.Property(
206+
name="data_source",
207+
data_type=wvc.DataType.OBJECT,
208+
nested_properties=[
209+
wvc.Property(name="url", data_type=wvc.DataType.TEXT),
210+
wvc.Property(name="record_locator", data_type=wvc.DataType.TEXT),
211+
],
212+
),
213+
wvc.Property(
214+
name="entities",
215+
data_type=wvc.DataType.OBJECT,
216+
nested_properties=[
217+
wvc.Property(
218+
name="items",
219+
data_type=wvc.DataType.OBJECT_ARRAY,
220+
nested_properties=[
221+
wvc.Property(name="entity", data_type=wvc.DataType.TEXT),
222+
wvc.Property(name="type", data_type=wvc.DataType.TEXT),
223+
],
224+
),
225+
wvc.Property(
226+
name="relationships",
227+
data_type=wvc.DataType.OBJECT_ARRAY,
228+
nested_properties=[
229+
wvc.Property(name="to", data_type=wvc.DataType.TEXT),
230+
wvc.Property(name="from", data_type=wvc.DataType.TEXT),
231+
wvc.Property(name="relationship", data_type=wvc.DataType.TEXT),
232+
],
233+
),
234+
],
235+
),
236+
],
237+
),
238+
],
239+
vectorizer_config=None, # Unstructured will generate the embeddings instead of Weaviate.
240+
)
241+
242+
client.close()
134243
```
135244

136-
See also :
245+
Unstructured cannot provide a schema that is guaranteed to work in all
246+
circumstances. This is because these schemas will vary based on your source files' types; how you
247+
want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors.
248+
249+
See also:
137250

138251
- [Collection schema](https://weaviate.io/developers/weaviate/config-refs/schema)
139252
- [Unstructured document elements and metadata](/api-reference/legacy-api/partition/document-elements)

0 commit comments

Comments
 (0)