Skip to content

Commit b43a104

Browse files
authored
Milvus destination connector: Add docs and links for Zilliz Cloud, and Milvus on IBM Cloud watsonx.data (#375)
1 parent 4498354 commit b43a104

File tree

1 file changed

+121
-80
lines changed

1 file changed

+121
-80
lines changed
Lines changed: 121 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,7 @@
1-
The following video shows how to fulfill the minimum set of Milvus requirements, demonstrating Milvus on IBM watsonx.data:
1+
- For the [Unstructured Platform](/platform/overview), only Milvus cloud-based instances (such as Zilliz Cloud, and Milvus on IBM watsonx.data) are supported.
2+
- For [Unstructured Ingest](/ingestion/overview), Milvus local and cloud-based instances are supported.
3+
4+
The following video shows how to fulfill the minimum set of requirements for Milvus cloud-based instances, demonstrating Milvus on IBM watsonx.data:
25

36
<iframe
47
width="560"
@@ -10,83 +13,121 @@ allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; pic
1013
allowfullscreen
1114
></iframe>
1215

13-
- A [Milvus instance](https://milvus.io/docs/install-overview.md).
14-
- The [URI](https://milvus.io/api-reference/pymilvus/v2.4.x/MilvusClient/Client/MilvusClient.md) of the instance.
15-
- The [username and password, or token](https://milvus.io/docs/authenticate.md) to access the instance.
16-
- The name of the [database](https://milvus.io/docs/manage_databases.md) in the instance.
17-
- The name of the [collection](https://milvus.io/docs/manage-collections.md) in the database.
18-
19-
Milvus requires the target collection to have a defined schema before Unstructured can write to the collection. The minimum viable
20-
schema for Unstructured contains only the fields `element_id`, `embeddings`, and `record_id`, as follows. This example code demonstrates the use of the
21-
[Python SDK for Milvus](https://pypi.org/project/pymilvus/) to create a collection with this minimum viable schema,
22-
targeting Milvus on IBM watsonx.data:
23-
24-
```python Python
25-
import os
26-
from pymilvus import (
27-
connections,
28-
FieldSchema,
29-
DataType,
30-
CollectionSchema,
31-
Collection,
32-
)
33-
34-
connections.connect(
35-
alias="default",
36-
host=os.getenv("MILVUS_GRPC_HOST"),
37-
port=os.getenv("MILVUS_GRPC_PORT"),
38-
user=os.getenv("MILVUS_USER"),
39-
password=os.getenv("MILVUS_PASSWORD"),
40-
secure=True
41-
)
42-
43-
primary_key = FieldSchema(
44-
name="element_id",
45-
dtype=DataType.VARCHAR,
46-
is_primary=True,
47-
max_length=200
48-
)
49-
50-
vector = FieldSchema(
51-
name="embeddings",
52-
dtype=DataType.FLOAT_VECTOR,
53-
dim=3072
54-
)
55-
56-
record_id = FieldSchema(
57-
name="record_id",
58-
dtype=DataType.VARCHAR,
59-
max_length=200
60-
)
61-
62-
schema = CollectionSchema(
63-
fields=[primary_key, vector, record_id],
64-
enable_dynamic_field=True
65-
)
66-
67-
collection = Collection(
68-
name="my_collection",
69-
schema=schema,
70-
using="default"
71-
)
72-
73-
index_params = {
74-
"metric_type": "L2",
75-
"index_type": "IVF_FLAT",
76-
"params": {"nlist": 1024}
77-
}
78-
79-
collection.create_index(
80-
field_name="embeddings",
81-
index_params=index_params
82-
)
83-
```
84-
85-
Other approaches, such as [creating collections instantly](https://milvus.io/docs/create-collection-instantly.md) or
86-
[setting nullable and default fields](https://milvus.io/docs/nullable-and-default.md), have not
87-
been fully evaluated by Unstructured and might produce unexpected results.
88-
89-
Unstructured cannot provide a schema that is guaranteed to work in all
90-
circumstances. This is because these schemas will vary based on your source files' types; how you
91-
want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors.
16+
- For Zilliz Cloud, you will need:
17+
18+
- A [Zilliz Cloud account](https://cloud.zilliz.com/signup).
19+
- A [Zilliz Cloud cluster](https://docs.zilliz.com/docs/create-cluster).
20+
- The URI of the cluster, also known as the cluster's _public endpoint_, which takes a format such as
21+
`https://<cluster-id>.<cluster-type>.<cloud-provider>-<region>.cloud.zilliz.com`.
22+
[Get the cluster's public endpoint](https://docs.zilliz.com/docs/manage-cluster#connect-to-cluster).
23+
- The token to access the cluster. [Get the cluster's token](https://docs.zilliz.com/docs/manage-cluster#connect-to-cluster).
24+
- The name of the [database](https://docs.zilliz.com/docs/database#create-database) in the instance.
25+
- The name of the [collection](https://docs.zilliz.com/docs/manage-collections-console#create-collection) in the database.
26+
27+
The collection must have a a defined schema before Unstructured can write to the collection. The minimum viable
28+
schema for Unstructured contains only the fields `element_id`, `embeddings`, and `record_id`, as follows:
29+
30+
| Field Name | Field Type | Max Length | Dimension | Index | Metric Type |
31+
|---|---|---|---|---|--|
32+
| `element_id` (primary key field) | **VARCHAR** | `200` | -- | -- | -- |
33+
| `embeddings` (vector field) | **FLOAT_VECTOR** | -- | `3072` | Yes (Checked) | **Cosine** |
34+
| `record_id` | **VARCHAR** | `200` | -- | -- | -- |
35+
36+
- For Milvus on IBM watsonx.data, you will need:
37+
38+
- An [IBM Cloud account](https://cloud.ibm.com/registration).
39+
- The [IBM watsonx.data subscription plan](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-getting-started).
40+
- A [Milvus service instance in IBM watsonx.data](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-adding-milvus-service).
41+
- The URI of the instance, which takes the format of `https://`, followed by instance's **GRPC host**, followed by a colon and the **GRPC port**.
42+
This takes the format of `https://<host>:<port>`.
43+
[Get the instance's GRPC host and GRPC port](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-conn-to-milvus).
44+
- The name of the [database](https://milvus.io/docs/manage_databases.md) in the instance.
45+
- The name of the [collection](https://milvus.io/docs/manage-collections.md) in the database. Note the collection requirements at the end of this section.
46+
- The uername and password to access the instance.
47+
The username for Milvus on IBM watsonx.data is always `ibmlhapikey`.
48+
The password for Milvus on IBM watsonx.data is in the form of an IBM Cloud user API key.
49+
[Get the user API key](https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui).
50+
51+
- For Milvus local, you will need:
52+
53+
- A [Milvus instance](https://milvus.io/docs/install-overview.md).
54+
- The [URI](https://milvus.io/api-reference/pymilvus/v2.4.x/MilvusClient/Client/MilvusClient.md) of the instance.
55+
- The name of the [database](https://milvus.io/docs/manage_databases.md) in the instance.
56+
- The name of the [collection](https://milvus.io/docs/manage-collections.md) in the database.
57+
Note the collection requirements at the end of this section.
58+
- The [username and password, or token](https://milvus.io/docs/authenticate.md) to access the instance.
59+
60+
All Milvus instances require the target collection to have a defined schema before Unstructured can write to the collection. The minimum viable
61+
schema for Unstructured contains only the fields `element_id`, `embeddings`, and `record_id`, as follows. This example code demonstrates the use of the
62+
[Python SDK for Milvus](https://pypi.org/project/pymilvus/) to create a collection with this minimum viable schema,
63+
targeting Milvus on IBM watsonx.data. For the `connections.connect` arguments to connect to other types of Milvus deployments, see your Milvus provider's documentation:
64+
65+
```python Python
66+
import os
67+
from pymilvus import (
68+
connections,
69+
FieldSchema,
70+
DataType,
71+
CollectionSchema,
72+
Collection,
73+
)
74+
75+
connections.connect(
76+
alias="default",
77+
host=os.getenv("MILVUS_GRPC_HOST"),
78+
port=os.getenv("MILVUS_GRPC_PORT"),
79+
user=os.getenv("MILVUS_USER"),
80+
password=os.getenv("MILVUS_PASSWORD"),
81+
secure=True
82+
)
83+
84+
primary_key = FieldSchema(
85+
name="element_id",
86+
dtype=DataType.VARCHAR,
87+
is_primary=True,
88+
max_length=200
89+
)
90+
91+
vector = FieldSchema(
92+
name="embeddings",
93+
dtype=DataType.FLOAT_VECTOR,
94+
dim=3072
95+
)
96+
97+
record_id = FieldSchema(
98+
name="record_id",
99+
dtype=DataType.VARCHAR,
100+
max_length=200
101+
)
102+
103+
schema = CollectionSchema(
104+
fields=[primary_key, vector, record_id],
105+
enable_dynamic_field=True
106+
)
107+
108+
collection = Collection(
109+
name="my_collection",
110+
schema=schema,
111+
using="default"
112+
)
113+
114+
index_params = {
115+
"metric_type": "L2",
116+
"index_type": "IVF_FLAT",
117+
"params": {"nlist": 1024}
118+
}
119+
120+
collection.create_index(
121+
field_name="embeddings",
122+
index_params=index_params
123+
)
124+
```
125+
126+
Other approaches, such as [creating collections instantly](https://milvus.io/docs/create-collection-instantly.md) or
127+
[setting nullable and default fields](https://milvus.io/docs/nullable-and-default.md), have not
128+
been fully evaluated by Unstructured and might produce unexpected results.
129+
130+
Unstructured cannot provide a schema that is guaranteed to work in all
131+
circumstances. This is because these schemas will vary based on your source files' types; how you
132+
want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors.
92133

0 commit comments

Comments
 (0)