Skip to content

Commit 9ae9007

Browse files
authored
Updates to IBM-related connectors (#771)
1 parent e2702ea commit 9ae9007

File tree

7 files changed

+155
-47
lines changed

7 files changed

+155
-47
lines changed

api-reference/workflow/destinations/ibm-watsonxdata.mdx

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,17 @@
22
title: IBM watsonx.data
33
---
44

5+
<Tip>
6+
The IBM watsonx.data destination connector relies on an Apache Iceberg-based catalog within the watsonx.data data store instance.
7+
Apache Iceberg is suitable for managed data storage and cataloging, but not for embedding storage or semantic similarity
8+
queries. For embedding storage and semantic similarity queries, Unstructured recommends that you use the following destination connectors
9+
instead:
10+
11+
- [Astra DB](/api-reference/workflow/destinations/astradb)
12+
- [Milvus](/api-reference/workflow/destinations/milvus) on IBM watsonx.data
13+
14+
</Tip>
15+
516
import FirstTimeAPIDestinationConnector from '/snippets/general-shared-text/first-time-api-destination-connector.mdx';
617

718
<FirstTimeAPIDestinationConnector />

open-source/ingestion/destination-connectors/ibm-watsonxdata.mdx

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,17 @@
22
title: IBM watsonx.data
33
---
44

5+
<Tip>
6+
The IBM watsonx.data destination connector relies on an Apache Iceberg-based catalog within the watsonx.data data store instance.
7+
Apache Iceberg is suitable for managed data storage and cataloging, but not for embedding storage or semantic similarity
8+
queries. For embedding storage and semantic similarity queries, Unstructured recommends that you use the following destination connectors
9+
instead:
10+
11+
- [Astra DB](/open-source/ingestion/destination-connectors/astradb)
12+
- [Milvus](/open-source/ingestion/destination-connectors/milvus) on IBM watsonx.data
13+
14+
</Tip>
15+
516
import SharedIBMWatsonxdata from '/snippets/dc-shared-text/ibm-watsonxdata-cli-api.mdx';
617

718
<SharedIBMWatsonxdata />
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
- `<name>` (_required_) - A unique name for this connector.
22
- `<token>` (_required_) - The application token for the database.
3-
- `<api-endpoint>` (_required_) - The databases associated API endpoint.
4-
- `<collection-name>` - The name of the collection in the namespace. If no value is provided, see the beginning of this article for the behavior at run time.
3+
- `<api-endpoint>` (_required_) - The database's associated API endpoint.
4+
- `<collection-name>` - The name of the collection in the keyspace. If no value is provided, see the beginning of this article for the behavior at run time.
55
- `<keyspace>` - The name of the keyspace in the collection. The default is `default_keyspace` if not otherwise specified.
66
- `<batch-size>` - The maximum number of records to send per batch. The default is `20` if not otherwise specified.
77
- `flatten_metadata` - Set to `true` to flatten the metadata into each record. Specifically, when flattened, the metadata key values are brought to the top level of the element, and the `metadata` key itself is removed. By default, the metadata is not flattened (`false`).

snippets/general-shared-text/astradb-platform.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Fill in the following fields:
22

33
- **Name** (_required_): A unique name for this connector.
4-
- **Collection Name**: The name of the collection in the namespace. If no value is provided, see the beginning of this article for the behavior at run time.
4+
- **Collection Name**: The name of the collection in the keyspace. If no value is provided, see the beginning of this article for the behavior at run time.
55
- **Keyspace** (_required_): The name of the keyspace in the collection.
66
- **Batch Size**: The maximum number of records per batch. The default is `20` if not otherwise specified.
77
- **Flatten Metadata**: Check this box to flatten the metadata into each record.

snippets/general-shared-text/astradb.mdx

Lines changed: 60 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -8,26 +8,70 @@ allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; pic
88
allowfullscreen
99
></iframe>
1010

11-
- An Astra account. [Create or sign in to an Astra account](https://astra.datastax.com/).
12-
- A database in the Astra account. [Create a database in an account](https://docs.datastax.com/en/astra-db-classic/databases/manage-create.html).
13-
- An application token for the database. [Create a database application token](https://docs.datastax.com/en/astra-db-serverless/administration/manage-application-tokens.html).
14-
- A namespace in the database. [Create a namespace in a database](https://docs.datastax.com/en/astra-db-serverless/databases/manage-namespaces.html#create-namespace).
15-
- A collection in the namespace. [Create a collection in a namespace](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection).
11+
- An IBM Cloud account or DataStax account.
1612

17-
An existing collection is not required. At runtime, the collection behavior is as follows:
13+
- For an IBM Cloud account, [sign up](https://cloud.ibm.com/registration) for an IBMid, and then [sign in](https://accounts.datastax.com/session-service/v1/login) to DataStax with your IBMid.
14+
- For a DataStax account, [sign up](https://astra.datastax.com/signup) for a DataStax account, and then [sign in](https://accounts.datastax.com/session-service/v1/login) to DataStax with your DataStax account.
15+
16+
- An Astra DB database in the DataStax account. To create a database:
17+
18+
a. After you sign in to DataStax, click **Create database**.<br/>
19+
b. Click the **Serverless (vector)** tile, if it is not already selected.<br/>
20+
c. For **Database name**, enter some unique name for the database.<br/>
21+
d. Select a **Provider** and a **Region**, and then click **Create database**.<br/>
22+
23+
[Learn more](https://docs.datastax.com/en/astra-db-classic/databases/manage-create.html).
24+
25+
- An application token for the database. To create an application token:
26+
27+
a. After you sign in to DataStax, in the list of databases, click the name of the target database.<br/>
28+
b. On the **Overview** tab, under **Database Details**, in the **Application Tokens** tile, click **Generate Token**.<br/>
29+
c. Enter some **Token description** and select and **Expiration** time period, and then click **Generate token**.<br/>
30+
d. Save the application token that is displayed to a secure location, and then click **Close**.<br/>
31+
32+
[Learn more](https://docs.datastax.com/en/astra-db-serverless/administration/manage-application-tokens.html).
33+
34+
- A keyspace in the database. To create a keyspace:
35+
36+
a. After you sign in to DataStax, in the list of databases, click the name of the target database.<br/>
37+
b. On the **Data Explorer** tab, in the **Keyspace** list, select **Create keyspace**.<br/>
38+
c. Enter some **Keyspace name**, and then click **Add keyspace**.<br/>
39+
40+
[Learn more](https://docs.datastax.com/en/astra-db-serverless/databases/manage-keyspaces.html#keyspaces).
41+
42+
- A collection in the keyspace.
1843

1944
For the [Unstructured UI](/ui/overview) and [Unstructured API](/api-reference/overview):
2045

21-
- If an existing collection name is specified, and Unstructured generates embeddings,
22-
but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail.
23-
You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again.
24-
- If a collection name is not specified, Unstructured creates a new collection in your namespace. If Unstructured generates embeddings,
25-
the new collections's name will be `u<short-workflow-id>_<short-embedding-model-name>_<number-of-dimensions>`.
26-
If Unstructured does not generate embeddings, the new collections's name will be `u<short-workflow-id`.
46+
- An existing collection is not required. At runtime, the collection behavior is as follows:
47+
48+
- If an existing collection name is specified, and Unstructured generates embeddings,
49+
but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail.
50+
You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again.
51+
- If a collection name is not specified, Unstructured creates a new collection in your keyspace. If Unstructured generates embeddings,
52+
the new collections's name will be `u<short-workflow-id>_<short-embedding-model-name>_<number-of-dimensions>`.
53+
If Unstructured does not generate embeddings, the new collections's name will be `u<short-workflow-id`.
2754

2855
For [Unstructured Ingest](/open-source/ingestion/overview):
2956

30-
- If an existing collection name is specified, and Unstructured generates embeddings,
31-
but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail.
32-
You must change your Unstructured embedding settings or your existing collections's embedding settings to match, and try the run again.
33-
- If a collection name is not specified, Unstructured creates a new collection in your Pinecone account. The new collection's name will be `unstructuredautocreated`.
57+
- For the source connector only, an existing collection is required.
58+
- For the destination connector only, an existing collection is not required. At runtime, the collection behavior is as follows:
59+
60+
- If an existing collection name is specified, and Unstructured generates embeddings,
61+
but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail.
62+
You must change your Unstructured embedding settings or your existing collections's embedding settings to match, and try the run again.
63+
- If a collection name is not specified, Unstructured creates a new collection in your keyspace. The new collection's name will be `unstructuredautocreated`.
64+
65+
To create a collection yourself:
66+
67+
a. After you sign in to DataStax, in the list of databases, click the name of the target database.<br/>
68+
b. On the **Data Explorer** tab, in the **Keyspace** list, select the name of the target keyspace.<br/>
69+
c. In the **Collections** list, select **Create collection**.<br/>
70+
d. Enter some **Collection name**.<br/>
71+
e. Turn on **Vector-enabled collection**, if it is not already turned on.<br/>
72+
f. For **Embedding generation method**, select **Bring my own**.<br/>
73+
g. For **Dimensions**, enter the number of dimensions for the embedding model that you plan to use.<br/>
74+
h. For **Similarity metric**, select **Cosine**.<br/>
75+
i. Click **Create collection**.<br/>
76+
77+
[Learn more](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection).

snippets/general-shared-text/milvus.mdx

Lines changed: 59 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,61 @@
1-
- For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Milvus cloud-based instances (such as Zilliz Cloud, and Milvus on IBM watsonx.data) are supported.
1+
- For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Milvus cloud-based instances (such as Milvus on IBM watsonx.data, or Zilliz Cloud) are supported.
22
- For [Unstructured Ingest](/open-source/ingestion/overview), Milvus local and cloud-based instances are supported.
33

4-
The following video shows how to fulfill the minimum set of requirements for Milvus cloud-based instances, demonstrating Milvus on IBM watsonx.data:
4+
- For Milvus on IBM watsonx.data, you will need:
5+
6+
<iframe
7+
width="560"
8+
height="315"
9+
src="https://www.youtube.com/embed/hLCwoe2fCnc"
10+
title="YouTube video player"
11+
frameborder="0"
12+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
13+
allowfullscreen
14+
></iframe>
15+
16+
- An [IBM Cloud account](https://cloud.ibm.com/registration).
17+
- An IBM watsonx.data [Lite plan](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-tutorial_prov_lite_1)
18+
or [Enterprise plan](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-getting-started_1) within your IBM Cloud account.
19+
20+
- If you are provisoning a Lite plan, be sure to choose the **Generative AI** use case when prompted, as this is the only use case offered that includes Milvus.
21+
22+
- A [Milvus service instance in IBM watsonx.data](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-adding-milvus-service).
23+
24+
- If you are creating a Milvus service instance within a watsonx.data Lite plan, when you are prompted to choose a Milvus instance size, you can only select **Lite**. Because the Lite
25+
Milvus instance size is recommended only for 384 dimensions, you should also use an embedding model that uses 384 dimensions only.
26+
- If you are creating a Milvus service instance within a watsonx.data Enterprise plan, you can choose any available Milvus instance size. However, all Milvus instance sizes other than
27+
**Custom** are recommended only for 384 dimensions, which means you should use an embedding model that uses 384 dimensions only.
28+
The **Custom** Milvus instance size is recommended for any number of dimensions.
29+
30+
- The URI of the instance, which takes the format of `https://`, followed by instance's **GRPC host**, followed by a colon and the **GRPC port**.
31+
This takes the format of `https://<host>:<port>`. To get this informatation, do the following:
32+
33+
a. Sign in to your IBM Cloud account.<br/>
34+
b. On the sidebar, click the **Resource list** icon. If the sidebar is not visible, click the **Navigation Menu** icon to the far left of the title bar.<br/>
35+
c. Expand **Databases**, and then click the name of the target **watsonx.data** plan.<br/>
36+
d. Click **Open web console**.<br/>
37+
e. On the sidebar, click **Infrastructure manager**. If the sidebar is not visible, click the **Global navigation** icon to the far left of the title bar.<br/>
38+
f. Click the target Milvus service instance.<br/>
39+
g. On the **Details** tab, under **Type**, click **View connect details**.<br/>
40+
h. Under **Service details**, expand **GRPC**, and note the value of **GRPC host** and **GRPC port**.<br/>
41+
42+
- The name of the [database](https://milvus.io/docs/manage_databases.md) in the instance.
43+
- The name of the [collection](https://milvus.io/docs/manage-collections.md) in the database. Note the collection requirements at the end of this section.
44+
- The username and password to access the instance.
45+
46+
- The username for Milvus on IBM watsonx.data is always `ibmlhapikey`.
47+
- The password for Milvus on IBM watsonx.data is in the form of an IBM Cloud user API key. To create an IBM Cloud user API key:
48+
49+
a. Sign in to your IBM Cloud account.<br/>
50+
b. In the title bar, click **Manage** and then, under **Security and access**, click **Access (IAM)**.<br/>
51+
c. On the sidebar, under **Manage identities**, click **API keys**. If the sidebar is not visible, click the **Navigation Menu** icon to the far left of the title bar.<br/>
52+
d. Click **Create**.<br/>
53+
e. Enter some **Name** for the API key.<br/>
54+
f. Optionally, enter some **Description** for the API key.<br/>
55+
g. For **Leaked action**, leave **Disable the leaked key** selected.<br/>
56+
h. For **Session management**, leave **No** selected.<br/>
57+
i. Click **Create**.<br/>
58+
j. Click **Download** (or **Copy**), and then download the API key to a secure location (or paste the copied API key into a secure location). You won't be able to access this API key from this dialog again. If you lose this API key, you can create a new one (and you should then delete the old one).<br/>
559

660
- For Zilliz Cloud, you will need:
761

@@ -54,31 +108,6 @@ The following video shows how to fulfill the minimum set of requirements for Mil
54108
The number of dimensions for the `embeddings` field must match the number of dimensions for the embedding model that you plan to use.
55109
</Warning>
56110

57-
- For Milvus on IBM watsonx.data, you will need:
58-
59-
<iframe
60-
width="560"
61-
height="315"
62-
src="https://www.youtube.com/embed/hLCwoe2fCnc"
63-
title="YouTube video player"
64-
frameborder="0"
65-
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
66-
allowfullscreen
67-
></iframe>
68-
69-
- An [IBM Cloud account](https://cloud.ibm.com/registration).
70-
- The [IBM watsonx.data subscription plan](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-getting-started).
71-
- A [Milvus service instance in IBM watsonx.data](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-adding-milvus-service).
72-
- The URI of the instance, which takes the format of `https://`, followed by instance's **GRPC host**, followed by a colon and the **GRPC port**.
73-
This takes the format of `https://<host>:<port>`.
74-
[Get the instance's GRPC host and GRPC port](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-conn-to-milvus).
75-
- The name of the [database](https://milvus.io/docs/manage_databases.md) in the instance.
76-
- The name of the [collection](https://milvus.io/docs/manage-collections.md) in the database. Note the collection requirements at the end of this section.
77-
- The username and password to access the instance.
78-
The username for Milvus on IBM watsonx.data is always `ibmlhapikey`.
79-
The password for Milvus on IBM watsonx.data is in the form of an IBM Cloud user API key.
80-
[Get the user API key](https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui).
81-
82111
- For Milvus local, you will need:
83112

84113
- A [Milvus instance](https://milvus.io/docs/install-overview.md).
@@ -89,7 +118,9 @@ The following video shows how to fulfill the minimum set of requirements for Mil
89118
- The [username and password, or token](https://milvus.io/docs/authenticate.md) to access the instance.
90119

91120
All Milvus instances require the target collection to have a defined schema before Unstructured can write to the collection. The minimum viable
92-
schema for Unstructured contains only the fields `element_id`, `embeddings`, `record_id`, and `text`, as follows. This example code demonstrates the use of the
121+
schema for Unstructured contains only the fields `element_id`, `embeddings`, `record_id`, and `text`, as follows.
122+
123+
This example code demonstrates the use of the
93124
[Python SDK for Milvus](https://pypi.org/project/pymilvus/) to create a collection with this schema,
94125
targeting Milvus on IBM watsonx.data. For the `MilvusClient` arguments to connect to other types of Milvus deployments, see your Milvus provider's documentation:
95126

ui/destinations/ibm-watsonxdata.mdx

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,17 @@
22
title: IBM watsonx.data
33
---
44

5+
<Tip>
6+
The IBM watsonx.data destination connector relies on an Apache Iceberg-based catalog within the watsonx.data data store instance.
7+
Apache Iceberg is suitable for managed data storage and cataloging, but not for embedding storage or semantic similarity
8+
queries. For embedding storage and semantic similarity queries, Unstructured recommends that you use the following destination connectors
9+
instead:
10+
11+
- [Astra DB](/ui/destinations/astradb)
12+
- [Milvus](/ui/destinations/milvus) on IBM watsonx.data
13+
14+
</Tip>
15+
516
import FirstTimeUIDestinationConnector from '/snippets/general-shared-text/first-time-ui-destination-connector.mdx';
617

718
<FirstTimeUIDestinationConnector />

0 commit comments

Comments
 (0)