|
39 | 39 | You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again. |
40 | 40 | - If a collection name is not specified, Unstructured creates a new collection in your Weaviate cluster. The new collection's name will be `Unstructuredautocreated`. |
41 | 41 |
|
42 | | - If Unstructured creates a new collection and generates embeddings, you will not see an embeddings property in tools such as the Weaviate Cloud |
| 42 | + If Unstructured creates a new collection and generates embeddings, you will not see an `embeddings` property in tools such as the Weaviate Cloud |
43 | 43 | **Collections** user interface. To view the generated embeddings, you can run a Weaviate GraphQL query such as the following. In this query, replace `<collection-name>` with |
44 | 44 | the name of the new collection, and replace `<property-name>` with the name of each additional available property that |
45 | 45 | you want to return results for, such as `text`, `type`, `element_id`, `record_id`, and so on. The embeddings will be |
|
59 | 59 | } |
60 | 60 | ``` |
61 | 61 |
|
62 | | -Weaviate requires an existing collection to have a data schema before you add data. At minimum, this schema must contain the `record_id` property, as follows: |
| 62 | +If [auto-schema](https://docs.weaviate.io/weaviate/config-refs/collections#auto-schema) is enabled in Weaviate (which it is by default), |
| 63 | +Weaviate can infer missing properties and add them to the collection definition at run time. However, it is a Weaviate best practice to manually define as much |
| 64 | +of the data schema in advance as possible, since manual definition gives you the most control. |
| 65 | + |
| 66 | +The minimum viable schema for Unstructured includes only the `element_id` and `record_id` properties. The `text` and `type` properties should also be included, but they are technically optional. |
| 67 | +If you are using Unstructured to generate embeddings, you must |
| 68 | + |
| 69 | +The following code example shows how to use the [weaviate-client](https://pypi.org/project/weaviate-client/) Python package to create a |
| 70 | +collection in a Weaviate Cloud database cluster with this minimum viable schema, and to specify that Unstructured will generate the embeddings for this collection. |
| 71 | +To connect to a locally hosted Weaviate instance instead, call [weaviate.connect_to_local](https://docs.weaviate.io/weaviate/connections/connect-local). |
| 72 | +To connect to Embedded Weaviate instead, call [weaviate.connect_to_embedded](https://docs.weaviate.io/weaviate/connections/connect-embedded). |
| 73 | + |
| 74 | +```python |
| 75 | +import os |
| 76 | +import weaviate |
| 77 | +from weaviate.classes.init import Auth |
| 78 | +import weaviate.classes.config as wvc |
| 79 | + |
| 80 | +client = weaviate.connect_to_weaviate_cloud( |
| 81 | + cluster_url=os.getenv("WEAVIATE_URL"), |
| 82 | + auth_credentials=Auth.api_key(api_key=os.getenv("WEAVIATE_API_KEY")), |
| 83 | +) |
| 84 | + |
| 85 | +collection = client.collections.create( |
| 86 | + name="MyCollection", |
| 87 | + properties=[ |
| 88 | + wvc.Property(name="element_id", data_type=wvc.DataType.UUID), |
| 89 | + wvc.Property(name="record_id", data_type=wvc.DataType.TEXT), |
| 90 | + wvc.Property(name="text", data_type=wvc.DataType.TEXT), |
| 91 | + wvc.Property(name="type", data_type=wvc.DataType.TEXT), |
| 92 | + ], |
| 93 | + vectorizer_config=None, # Unstructured will generate the embeddings instead of Weaviate. |
| 94 | +) |
| 95 | + |
| 96 | +client.close() |
| 97 | +``` |
| 98 | + |
| 99 | +For objects in the `metadata` field that Unstructured produces and that you want to store in a Weaviate collection, be sure to follow |
| 100 | +Unstructured's `metadata` field naming convention. For example, if Unstructured produces a `metadata` field with the following |
| 101 | +child objects: |
63 | 102 |
|
64 | 103 | ```json |
65 | | -{ |
66 | | - "class": "Elements", |
67 | | - "properties": [ |
68 | | - { |
69 | | - "name": "record_id", |
70 | | - "dataType": ["text"] |
71 | | - } |
| 104 | +"metadata": { |
| 105 | + "is_extracted": "true", |
| 106 | + "coordinates": { |
| 107 | + "points": [ |
| 108 | + [ |
| 109 | + 134.20055555555555, |
| 110 | + 241.36027777777795 |
| 111 | + ], |
| 112 | + [ |
| 113 | + 134.20055555555555, |
| 114 | + 420.0269444444447 |
| 115 | + ], |
| 116 | + [ |
| 117 | + 529.7005555555555, |
| 118 | + 420.0269444444447 |
| 119 | + ], |
| 120 | + [ |
| 121 | + 529.7005555555555, |
| 122 | + 241.36027777777795 |
| 123 | + ] |
| 124 | + ], |
| 125 | + "system": "PixelSpace", |
| 126 | + "layout_width": 1654, |
| 127 | + "layout_height": 2339 |
| 128 | + }, |
| 129 | + "filetype": "application/pdf", |
| 130 | + "languages": [ |
| 131 | + "eng" |
| 132 | + ], |
| 133 | + "page_number": 1, |
| 134 | + "image_mime_type": "image/jpeg", |
| 135 | + "filename": "realestate.pdf", |
| 136 | + "data_source": { |
| 137 | + "url": "file:///home/etl/node/downloads/00000000-0000-0000-0000-000000000001/7458635f-realestate.pdf", |
| 138 | + "record_locator": { |
| 139 | + "protocol": "file", |
| 140 | + "remote_file_path": "file:///home/etl/node/downloads/00000000-0000-0000-0000-000000000001/7458635f-realestate.pdf" |
| 141 | + } |
| 142 | + }, |
| 143 | + "entities": { |
| 144 | + "items": [ |
| 145 | + { |
| 146 | + "entity": "HOME FOR FUTURE", |
| 147 | + "type": "ORGANIZATION" |
| 148 | + }, |
| 149 | + { |
| 150 | + "entity": "221 Queen Street, Melbourne VIC 3000", |
| 151 | + "type": "LOCATION" |
| 152 | + } |
| 153 | + ], |
| 154 | + "relationships": [ |
| 155 | + { |
| 156 | + "from": "HOME FOR FUTURE", |
| 157 | + "relationship": "based_in", |
| 158 | + "to": "221 Queen Street, Melbourne VIC 3000" |
| 159 | + } |
72 | 160 | ] |
| 161 | + } |
73 | 162 | } |
74 | 163 | ``` |
75 | 164 |
|
76 | | -Weaviate generates any additional properties based on the incoming data. |
| 165 | +You could create corresponding properties in your collection's schema by using the following property names and data types: |
77 | 166 |
|
78 | | -If you have specific schema requirements, you can define the schema manually. |
79 | | -Unstructured cannot provide a schema that is guaranteed to work for everyone in all circumstances. |
80 | | -This is because these schemas will vary based on |
81 | | -your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; |
82 | | -any custom post-processing code that you run; and other factors. |
| 167 | +```python |
| 168 | +import os |
| 169 | +import weaviate |
| 170 | +from weaviate.classes.init import Auth |
| 171 | +import weaviate.classes.config as wvc |
83 | 172 |
|
84 | | -You can adapt the following collection schema example for your own specific schema requirements: |
| 173 | +client = weaviate.connect_to_weaviate_cloud( |
| 174 | + cluster_url=os.getenv("WEAVIATE_URL"), |
| 175 | + auth_credentials=Auth.api_key(api_key=os.getenv("WEAVIATE_API_KEY")), |
| 176 | +) |
85 | 177 |
|
86 | | -```json |
87 | | -{ |
88 | | - "class": "Elements", |
89 | | - "properties": [ |
90 | | - { |
91 | | - "name": "record_id", |
92 | | - "dataType": ["text"] |
93 | | - }, |
94 | | - { |
95 | | - "name": "element_id", |
96 | | - "dataType": ["text"] |
97 | | - }, |
98 | | - { |
99 | | - "name": "text", |
100 | | - "dataType": ["text"] |
101 | | - }, |
102 | | - { |
103 | | - "name": "embeddings", |
104 | | - "dataType": ["number[]"] |
105 | | - }, |
106 | | - { |
107 | | - "name": "metadata", |
108 | | - "dataType": ["object"], |
109 | | - "nestedProperties": [ |
110 | | - { |
111 | | - "name": "parent_id", |
112 | | - "dataType": ["text"] |
113 | | - }, |
114 | | - { |
115 | | - "name": "page_number", |
116 | | - "dataType": ["text"] |
117 | | - }, |
118 | | - { |
119 | | - "name": "is_continuation", |
120 | | - "dataType": ["boolean"] |
121 | | - }, |
122 | | - { |
123 | | - "name": "orig_elements", |
124 | | - "dataType": ["text"] |
125 | | - }, |
126 | | - { |
127 | | - "name": "partitioner_type", |
128 | | - "dataType": ["text"] |
129 | | - } |
130 | | - ] |
131 | | - } |
132 | | - ] |
133 | | -} |
| 178 | +collection = client.collections.create( |
| 179 | + name="MyCollection", |
| 180 | + properties=[ |
| 181 | + wvc.Property(name="element_id", data_type=wvc.DataType.UUID), |
| 182 | + wvc.Property(name="record_id", data_type=wvc.DataType.TEXT), |
| 183 | + wvc.Property(name="text", data_type=wvc.DataType.TEXT), |
| 184 | + wvc.Property(name="type", data_type=wvc.DataType.TEXT), |
| 185 | + wvc.Property( |
| 186 | + name="metadata", |
| 187 | + data_type=wvc.DataType.OBJECT, |
| 188 | + nested_properties=[ |
| 189 | + wvc.Property(name="is_extracted", data_type=wvc.DataType.TEXT), |
| 190 | + wvc.Property( |
| 191 | + name="coordinates", |
| 192 | + data_type=wvc.DataType.OBJECT, |
| 193 | + nested_properties=[ |
| 194 | + wvc.Property(name="points", data_type=wvc.DataType.TEXT), |
| 195 | + wvc.Property(name="system", data_type=wvc.DataType.TEXT), |
| 196 | + wvc.Property(name="layout_width", data_type=wvc.DataType.NUMBER), |
| 197 | + wvc.Property(name="layout_height", data_type=wvc.DataType.NUMBER), |
| 198 | + ], |
| 199 | + ), |
| 200 | + wvc.Property(name="filetype", data_type=wvc.DataType.TEXT), |
| 201 | + wvc.Property(name="languages", data_type=wvc.DataType.TEXT_ARRAY), |
| 202 | + wvc.Property(name="page_number", data_type=wvc.DataType.TEXT), |
| 203 | + wvc.Property(name="image_mime_type", data_type=wvc.DataType.TEXT), |
| 204 | + wvc.Property(name="filename", data_type=wvc.DataType.TEXT), |
| 205 | + wvc.Property( |
| 206 | + name="data_source", |
| 207 | + data_type=wvc.DataType.OBJECT, |
| 208 | + nested_properties=[ |
| 209 | + wvc.Property(name="url", data_type=wvc.DataType.TEXT), |
| 210 | + wvc.Property(name="record_locator", data_type=wvc.DataType.TEXT), |
| 211 | + ], |
| 212 | + ), |
| 213 | + wvc.Property( |
| 214 | + name="entities", |
| 215 | + data_type=wvc.DataType.OBJECT, |
| 216 | + nested_properties=[ |
| 217 | + wvc.Property( |
| 218 | + name="items", |
| 219 | + data_type=wvc.DataType.OBJECT_ARRAY, |
| 220 | + nested_properties=[ |
| 221 | + wvc.Property(name="entity", data_type=wvc.DataType.TEXT), |
| 222 | + wvc.Property(name="type", data_type=wvc.DataType.TEXT), |
| 223 | + ], |
| 224 | + ), |
| 225 | + wvc.Property( |
| 226 | + name="relationships", |
| 227 | + data_type=wvc.DataType.OBJECT_ARRAY, |
| 228 | + nested_properties=[ |
| 229 | + wvc.Property(name="to", data_type=wvc.DataType.TEXT), |
| 230 | + wvc.Property(name="from", data_type=wvc.DataType.TEXT), |
| 231 | + wvc.Property(name="relationship", data_type=wvc.DataType.TEXT), |
| 232 | + ], |
| 233 | + ), |
| 234 | + ], |
| 235 | + ), |
| 236 | + ], |
| 237 | + ), |
| 238 | + ], |
| 239 | + vectorizer_config=None, # Unstructured will generate the embeddings instead of Weaviate. |
| 240 | +) |
| 241 | + |
| 242 | +client.close() |
134 | 243 | ``` |
135 | 244 |
|
136 | | -See also : |
| 245 | +Unstructured cannot provide a schema that is guaranteed to work in all |
| 246 | +circumstances. This is because these schemas will vary based on your source files' types; how you |
| 247 | +want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. |
| 248 | + |
| 249 | +See also: |
137 | 250 |
|
138 | 251 | - [Collection schema](https://weaviate.io/developers/weaviate/config-refs/schema) |
139 | 252 | - [Unstructured document elements and metadata](/api-reference/legacy-api/partition/document-elements) |
0 commit comments