Updates to IBM-related connectors (#771)

Paul-Cornell · web-flow · commit 9ae9007f24cd · 2025-10-28T11:40:00.000-07:00
diff --git a/api-reference/workflow/destinations/ibm-watsonxdata.mdx b/api-reference/workflow/destinations/ibm-watsonxdata.mdx
@@ -2,6 +2,17 @@
 title: IBM watsonx.data
 ---
 
+<Tip>
+    The IBM watsonx.data destination connector relies on an Apache Iceberg-based catalog within the watsonx.data data store instance. 
+    Apache Iceberg is suitable for managed data storage and cataloging, but not for embedding storage or semantic similarity 
+    queries. For embedding storage and semantic similarity queries, Unstructured recommends that you use the following destination connectors 
+    instead:
+
+    - [Astra DB](/api-reference/workflow/destinations/astradb)
+    - [Milvus](/api-reference/workflow/destinations/milvus) on IBM watsonx.data
+
+</Tip>
+
 import FirstTimeAPIDestinationConnector from '/snippets/general-shared-text/first-time-api-destination-connector.mdx';
 
 <FirstTimeAPIDestinationConnector />
diff --git a/open-source/ingestion/destination-connectors/ibm-watsonxdata.mdx b/open-source/ingestion/destination-connectors/ibm-watsonxdata.mdx
@@ -2,6 +2,17 @@
 title: IBM watsonx.data
 ---
 
+<Tip>
+    The IBM watsonx.data destination connector relies on an Apache Iceberg-based catalog within the watsonx.data data store instance. 
+    Apache Iceberg is suitable for managed data storage and cataloging, but not for embedding storage or semantic similarity 
+    queries. For embedding storage and semantic similarity queries, Unstructured recommends that you use the following destination connectors 
+    instead:
+
+    - [Astra DB](/open-source/ingestion/destination-connectors/astradb)
+    - [Milvus](/open-source/ingestion/destination-connectors/milvus) on IBM watsonx.data
+
+</Tip>
+
 import SharedIBMWatsonxdata from '/snippets/dc-shared-text/ibm-watsonxdata-cli-api.mdx';
 
 <SharedIBMWatsonxdata />
diff --git a/snippets/general-shared-text/astradb-api-placeholders.mdx b/snippets/general-shared-text/astradb-api-placeholders.mdx
@@ -1,7 +1,7 @@
 - `<name>` (_required_) - A unique name for this connector.
 - `<token>` (_required_) - The application token for the database.
-- `<api-endpoint>` (_required_) - The database’s associated API endpoint.
-- `<collection-name>` - The name of the collection in the namespace. If no value is provided, see the beginning of this article for the behavior at run time.
+- `<api-endpoint>` (_required_) - The database's associated API endpoint.
+- `<collection-name>` - The name of the collection in the keyspace. If no value is provided, see the beginning of this article for the behavior at run time.
 - `<keyspace>` - The name of the keyspace in the collection. The default is `default_keyspace` if not otherwise specified.
 - `<batch-size>` - The maximum number of records to send per batch. The default is `20` if not otherwise specified.
 - `flatten_metadata` - Set to `true` to flatten the metadata into each record. Specifically, when flattened, the metadata key values are brought to the top level of the element, and the `metadata` key itself is removed. By default, the metadata is not flattened (`false`).
diff --git a/snippets/general-shared-text/astradb-platform.mdx b/snippets/general-shared-text/astradb-platform.mdx
@@ -1,7 +1,7 @@
 Fill in the following fields:
 
 - **Name** (_required_): A unique name for this connector.
-- **Collection Name**: The name of the collection in the namespace. If no value is provided, see the beginning of this article for the behavior at run time.
+- **Collection Name**: The name of the collection in the keyspace. If no value is provided, see the beginning of this article for the behavior at run time.
 - **Keyspace** (_required_): The name of the keyspace in the collection.
 - **Batch Size**: The maximum number of records per batch. The default is `20` if not otherwise specified.
 - **Flatten Metadata**: Check this box to flatten the metadata into each record. 
diff --git a/snippets/general-shared-text/astradb.mdx b/snippets/general-shared-text/astradb.mdx
@@ -8,26 +8,70 @@ allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; pic
 allowfullscreen
 ></iframe>
 
-- An Astra account. [Create or sign in to an Astra account](https://astra.datastax.com/).
-- A database in the Astra account. [Create a database in an account](https://docs.datastax.com/en/astra-db-classic/databases/manage-create.html).
-- An application token for the database. [Create a database application token](https://docs.datastax.com/en/astra-db-serverless/administration/manage-application-tokens.html).
-- A namespace in the database. [Create a namespace in a database](https://docs.datastax.com/en/astra-db-serverless/databases/manage-namespaces.html#create-namespace).
-- A collection in the namespace. [Create a collection in a namespace](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection).
+- An IBM Cloud account or DataStax account.
 
-  An existing collection is not required. At runtime, the collection behavior is as follows:
+  - For an IBM Cloud account, [sign up](https://cloud.ibm.com/registration) for an IBMid, and then [sign in](https://accounts.datastax.com/session-service/v1/login) to DataStax with your IBMid.
+  - For a DataStax account, [sign up](https://astra.datastax.com/signup) for a DataStax account, and then [sign in](https://accounts.datastax.com/session-service/v1/login) to DataStax with your DataStax account.
+
+- An Astra DB database in the DataStax account. To create a database:
+
+  a. After you sign in to DataStax, click **Create database**.<br/>
+  b. Click the **Serverless (vector)** tile, if it is not already selected.<br/>
+  c. For **Database name**, enter some unique name for the database.<br/>
+  d. Select a **Provider** and a **Region**, and then click **Create database**.<br/>
+  
+  [Learn more](https://docs.datastax.com/en/astra-db-classic/databases/manage-create.html).
+
+- An application token for the database. To create an application token:
+
+  a. After you sign in to DataStax, in the list of databases, click the name of the target database.<br/>
+  b. On the **Overview** tab, under **Database Details**, in the **Application Tokens** tile, click **Generate Token**.<br/>
+  c. Enter some **Token description** and select and **Expiration** time period, and then click **Generate token**.<br/>
+  d. Save the application token that is displayed to a secure location, and then click **Close**.<br/>
+
+  [Learn more](https://docs.datastax.com/en/astra-db-serverless/administration/manage-application-tokens.html).
+
+- A keyspace in the database. To create a keyspace:
+
+  a. After you sign in to DataStax, in the list of databases, click the name of the target database.<br/>
+  b. On the **Data Explorer** tab, in the **Keyspace** list, select **Create keyspace**.<br/>
+  c. Enter some **Keyspace name**, and then click **Add keyspace**.<br/>
+
+  [Learn more](https://docs.datastax.com/en/astra-db-serverless/databases/manage-keyspaces.html#keyspaces).
+
+- A collection in the keyspace.
 
   For the [Unstructured UI](/ui/overview) and [Unstructured API](/api-reference/overview):
 
-  - If an existing collection name is specified, and Unstructured generates embeddings, 
-    but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. 
-    You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again.
-  - If a collection name is not specified, Unstructured creates a new collection in your namespace. If Unstructured generates embeddings, 
-    the new collections's name will be `u<short-workflow-id>_<short-embedding-model-name>_<number-of-dimensions>`. 
-    If Unstructured does not generate embeddings, the new collections's name will be `u<short-workflow-id`.
+  - An existing collection is not required. At runtime, the collection behavior is as follows:
+  
+    - If an existing collection name is specified, and Unstructured generates embeddings, 
+      but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. 
+      You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again.
+    - If a collection name is not specified, Unstructured creates a new collection in your keyspace. If Unstructured generates embeddings, 
+      the new collections's name will be `u<short-workflow-id>_<short-embedding-model-name>_<number-of-dimensions>`. 
+      If Unstructured does not generate embeddings, the new collections's name will be `u<short-workflow-id`.
 
   For [Unstructured Ingest](/open-source/ingestion/overview):
 
-  - If an existing collection name is specified, and Unstructured generates embeddings, 
-    but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. 
-    You must change your Unstructured embedding settings or your existing collections's embedding settings to match, and try the run again. 
-  - If a collection name is not specified, Unstructured creates a new collection in your Pinecone account. The new collection's name will be `unstructuredautocreated`.
+  - For the source connector only, an existing collection is required.
+  - For the destination connector only, an existing collection is not required. At runtime, the collection behavior is as follows:
+  
+    - If an existing collection name is specified, and Unstructured generates embeddings, 
+      but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. 
+      You must change your Unstructured embedding settings or your existing collections's embedding settings to match, and try the run again. 
+    - If a collection name is not specified, Unstructured creates a new collection in your keyspace. The new collection's name will be `unstructuredautocreated`.
+
+  To create a collection yourself:
+
+  a. After you sign in to DataStax, in the list of databases, click the name of the target database.<br/>
+  b. On the **Data Explorer** tab, in the **Keyspace** list, select the name of the target keyspace.<br/>
+  c. In the **Collections** list, select **Create collection**.<br/>
+  d. Enter some **Collection name**.<br/>
+  e. Turn on **Vector-enabled collection**, if it is not already turned on.<br/>
+  f. For **Embedding generation method**, select **Bring my own**.<br/>
+  g. For **Dimensions**, enter the number of dimensions for the embedding model that you plan to use.<br/>
+  h. For **Similarity metric**, select **Cosine**.<br/>
+  i. Click **Create collection**.<br/>
+
+  [Learn more](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection).
diff --git a/snippets/general-shared-text/milvus.mdx b/snippets/general-shared-text/milvus.mdx
@@ -1,7 +1,61 @@
-- For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Milvus cloud-based instances (such as Zilliz Cloud, and Milvus on IBM watsonx.data) are supported.
+- For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Milvus cloud-based instances (such as Milvus on IBM watsonx.data, or Zilliz Cloud) are supported.
 - For [Unstructured Ingest](/open-source/ingestion/overview), Milvus local and cloud-based instances are supported.
 
-The following video shows how to fulfill the minimum set of requirements for Milvus cloud-based instances, demonstrating Milvus on IBM watsonx.data:
+- For Milvus on IBM watsonx.data, you will need:
+
+  <iframe
+  width="560"
+  height="315"
+  src="https://www.youtube.com/embed/hLCwoe2fCnc"
+  title="YouTube video player"
+  frameborder="0"
+  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
+  allowfullscreen
+  ></iframe>
+
+  - An [IBM Cloud account](https://cloud.ibm.com/registration).
+  - An IBM watsonx.data [Lite plan](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-tutorial_prov_lite_1) 
+    or [Enterprise plan](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-getting-started_1) within your IBM Cloud account.
+
+    - If you are provisoning a Lite plan, be sure to choose the **Generative AI** use case when prompted, as this is the only use case offered that includes Milvus.
+
+  - A [Milvus service instance in IBM watsonx.data](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-adding-milvus-service).
+
+    - If you are creating a Milvus service instance within a watsonx.data Lite plan, when you are prompted to choose a Milvus instance size, you can only select **Lite**. Because the Lite 
+      Milvus instance size is recommended only for 384 dimensions, you should also use an embedding model that uses 384 dimensions only.
+    - If you are creating a Milvus service instance within a watsonx.data Enterprise plan, you can choose any available Milvus instance size. However, all Milvus instance sizes other than 
+      **Custom** are recommended only for 384 dimensions, which means you should use an embedding model that uses 384 dimensions only. 
+      The **Custom** Milvus instance size is recommended for any number of dimensions.
+
+  - The URI of the instance, which takes the format of `https://`, followed by instance's **GRPC host**, followed by a colon and the **GRPC port**. 
+    This takes the format of `https://<host>:<port>`. To get this informatation, do the following:
+
+    a. Sign in to your IBM Cloud account.<br/>
+    b. On the sidebar, click the **Resource list** icon. If the sidebar is not visible, click the **Navigation Menu** icon to the far left of the title bar.<br/>
+    c. Expand **Databases**, and then click the name of the target **watsonx.data** plan.<br/>
+    d. Click **Open web console**.<br/>
+    e. On the sidebar, click **Infrastructure manager**. If the sidebar is not visible, click the **Global navigation** icon to the far left of the title bar.<br/>
+    f. Click the target Milvus service instance.<br/>
+    g. On the **Details** tab, under **Type**, click **View connect details**.<br/>
+    h. Under **Service details**, expand **GRPC**, and note the value of **GRPC host** and **GRPC port**.<br/>
+
+  - The name of the [database](https://milvus.io/docs/manage_databases.md) in the instance.
+  - The name of the [collection](https://milvus.io/docs/manage-collections.md) in the database. Note the collection requirements at the end of this section.
+  - The username and password to access the instance. 
+
+    - The username for Milvus on IBM watsonx.data is always `ibmlhapikey`. 
+    - The password for Milvus on IBM watsonx.data is in the form of an IBM Cloud user API key. To create an IBM Cloud user API key:
+
+      a. Sign in to your IBM Cloud account.<br/>
+      b. In the title bar, click **Manage** and then, under **Security and access**, click **Access (IAM)**.<br/>
+      c. On the sidebar, under **Manage identities**, click **API keys**. If the sidebar is not visible, click the **Navigation Menu** icon to the far left of the title bar.<br/>
+      d. Click **Create**.<br/>
+      e. Enter some **Name** for the API key.<br/>
+      f. Optionally, enter some **Description** for the API key.<br/>
+      g. For **Leaked action**, leave **Disable the leaked key** selected.<br/>
+      h. For **Session management**, leave **No** selected.<br/>
+      i. Click **Create**.<br/>
+      j. Click **Download** (or **Copy**), and then download the API key to a secure location (or paste the copied API key into a secure location). You won't be able to access this API key from this dialog again. If you lose this API key, you can create a new one (and you should then delete the old one).<br/>
 
 - For Zilliz Cloud, you will need:
 
@@ -54,31 +108,6 @@ The following video shows how to fulfill the minimum set of requirements for Mil
         The number of dimensions for the `embeddings` field must match the number of dimensions for the embedding model that you plan to use.
     </Warning>
 
-- For Milvus on IBM watsonx.data, you will need:
-
-  <iframe
-  width="560"
-  height="315"
-  src="https://www.youtube.com/embed/hLCwoe2fCnc"
-  title="YouTube video player"
-  frameborder="0"
-  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
-  allowfullscreen
-  ></iframe>
-
-  - An [IBM Cloud account](https://cloud.ibm.com/registration).
-  - The [IBM watsonx.data subscription plan](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-getting-started).
-  - A [Milvus service instance in IBM watsonx.data](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-adding-milvus-service).
-  - The URI of the instance, which takes the format of `https://`, followed by instance's **GRPC host**, followed by a colon and the **GRPC port**. 
-    This takes the format of `https://<host>:<port>`. 
-    [Get the instance's GRPC host and GRPC port](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-conn-to-milvus).
-  - The name of the [database](https://milvus.io/docs/manage_databases.md) in the instance.
-  - The name of the [collection](https://milvus.io/docs/manage-collections.md) in the database. Note the collection requirements at the end of this section.
-  - The username and password to access the instance. 
-    The username for Milvus on IBM watsonx.data is always `ibmlhapikey`. 
-    The password for Milvus on IBM watsonx.data is in the form of an IBM Cloud user API key. 
-    [Get the user API key](https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui).
-
 - For Milvus local, you will need:
 
   - A [Milvus instance](https://milvus.io/docs/install-overview.md).
@@ -89,7 +118,9 @@ The following video shows how to fulfill the minimum set of requirements for Mil
   - The [username and password, or token](https://milvus.io/docs/authenticate.md) to access the instance.
 
 All Milvus instances require the target collection to have a defined schema before Unstructured can write to the collection. The minimum viable 
-schema for Unstructured contains only the fields `element_id`, `embeddings`, `record_id`, and `text`, as follows. This example code demonstrates the use of the 
+schema for Unstructured contains only the fields `element_id`, `embeddings`, `record_id`, and `text`, as follows. 
+
+This example code demonstrates the use of the 
 [Python SDK for Milvus](https://pypi.org/project/pymilvus/) to create a collection with this schema, 
 targeting Milvus on IBM watsonx.data. For the `MilvusClient` arguments to connect to other types of Milvus deployments, see your Milvus provider's documentation:
 
diff --git a/ui/destinations/ibm-watsonxdata.mdx b/ui/destinations/ibm-watsonxdata.mdx
@@ -2,6 +2,17 @@
 title: IBM watsonx.data
 ---
 
+<Tip>
+    The IBM watsonx.data destination connector relies on an Apache Iceberg-based catalog within the watsonx.data data store instance. 
+    Apache Iceberg is suitable for managed data storage and cataloging, but not for embedding storage or semantic similarity 
+    queries. For embedding storage and semantic similarity queries, Unstructured recommends that you use the following destination connectors 
+    instead:
+
+    - [Astra DB](/ui/destinations/astradb)
+    - [Milvus](/ui/destinations/milvus) on IBM watsonx.data
+
+</Tip>
+
 import FirstTimeUIDestinationConnector from '/snippets/general-shared-text/first-time-ui-destination-connector.mdx';
 
 <FirstTimeUIDestinationConnector />