|
| 1 | +- An [IBM Cloud account](https://cloud.ibm.com/login). [Create an IBM Cloud account](https://cloud.ibm.com/registration) if you do not already have one. |
| 2 | +- An API key for the IBM Cloud account. If you do not have one already, create one as follows: |
| 3 | + |
| 4 | + 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). |
| 5 | + 2. In the top navigation bar, click **Manage** and then, under **Security and access**, click **Access (IAM)**. |
| 6 | + 3. On the sidebar, under **Manage identities**, click **API keys**. |
| 7 | + 4. With the **View** list showing **My IBM Cloud API keys**, click **Create**. |
| 8 | + 5. Enter some **Name** and an optional **Description** for the API key. |
| 9 | + 6. Leave **Leaked action** set to **Disable the leaked key** and **Session creation** set to **No**. |
| 10 | + 7. Click **Create**. |
| 11 | + 8. Click **Copy** or **Download** to copy or save the API key to a secure location. You won't be able to access this API key from this screen again. |
| 12 | + |
| 13 | +- An IBM Cloud Object Storage (COS) instance in the account, and a bucket within that instance. If you do not have them already, |
| 14 | + create them as follows: |
| 15 | + |
| 16 | + 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). |
| 17 | + 2. On the sidebar, click the **Resource list** icon. If the sidebar is not visible, click the **Navigation Menu** icon to the far left of the |
| 18 | + top navigation bar. |
| 19 | + 3. Click **Create resource**. |
| 20 | + 4. With **IBM Cloud catalog** selected, search for and select **Object Storage**. |
| 21 | + 5. Complete the on-screen instructions to finish creating the COS instance. |
| 22 | + 6. With the COS instance's settings page shown, on the **Buckets** tab, click **Create bucket**. |
| 23 | + 7. Complete the on-screen instructions to finish creating the bucket. |
| 24 | + |
| 25 | +- The name, region, and public endpoint for the target bucket within the target Cloud Object Storage (COS) instance. To get these: |
| 26 | + |
| 27 | + 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). |
| 28 | + 2. On the sidebar, click the **Resource list** icon. If the sidebar is not visible, click the **Navigation Menu** icon to the far left of the |
| 29 | + top navigation bar. |
| 30 | + 3. In the list of resources, expand **Storage**, and then click the target COS instance. |
| 31 | + 4. On the **Buckets** tab, click the target bucket. |
| 32 | + 5. On the **Configuration** tab, note the following: |
| 33 | + |
| 34 | + - Under **Bucket details**, note the **Bucket name**. This is the bucket's name. |
| 35 | + - Under **Bucket details** section, note the value inside of the parentheses inside **Location**, for example `us-east`. This is the bucket's region. |
| 36 | + - Under **Endpoints**, note the value of **Public**, for example `s3.us-east.cloud-object-storage.appdomain.cloud`. (Ignore the values of |
| 37 | + **Private** and **Direct**). This is the bucket's public endpoint. |
| 38 | + |
| 39 | +- An HMAC access key ID and secret access key for the target Cloud Object Storage (COS) instance. If you do not have them already, |
| 40 | + get or create them as follows: |
| 41 | + |
| 42 | + 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). |
| 43 | + 2. On the sidebar, click the **Resource list** icon. If the sidebar is not visible, click the **Navigation Menu** icon to the far left of the |
| 44 | + top navigation bar. |
| 45 | + 3. In the list of resources, expand **Storage**, and then click the target COS instance. |
| 46 | + 4. On the **Service credentials** tab, if there is a credential that you want to use in the list, expand the credential, and copy the following values to a secure location: |
| 47 | + |
| 48 | + - `access_key_id` under `cos_hmac_keys`, which represents the HMAC access key ID. |
| 49 | + - `secret_access_key` under `cos_hmac_keys`, which represents the HMAC secret access key. |
| 50 | + |
| 51 | + After you have copied the preceding values, you have completed this procedure. |
| 52 | + |
| 53 | + 5. If there is not a credential that you want to use, or there are no credentials at all, click **New Credential**. |
| 54 | + 6. Enter some **Name** for the credential. |
| 55 | + 7. For **Role**, select at least **Writer**, leave **Select Service ID** set to **Auto Generated**, |
| 56 | + switch on **Include HMAC Credential**, and then click **Add**. |
| 57 | + 8. In the list of credentials, expand the credential, and copy the following values to a secure location: |
| 58 | + |
| 59 | + - `access_key_id` under `cos_hmac_keys`, which represents the HMAC access key ID. |
| 60 | + - `secret_access_key` under `cos_hmac_keys`, which represents the HMAC secret access key. |
| 61 | + |
| 62 | +- An IBM watsonx.data data store instance in the IBM Cloud account. If you do not have one already, create one as follows: |
| 63 | + |
| 64 | + 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). |
| 65 | + 2. On the sidebar, click the **Resource list** icon. If the sidebar is not visible, click the **Navigation Menu** icon to the far left of the |
| 66 | + top navigation bar. |
| 67 | + 3. Click **Create resource**. |
| 68 | + 4. With **IBM Cloud catalog** selected, search for and select **watsonx.data**. |
| 69 | + 5. Complete the on-screen instructions to finish creating the watsonx.data data store instance. |
| 70 | + |
| 71 | +- An Apache Iceberg-based catalog within the watsonx.data data store instance. If you do not have one already, create one as follows: |
| 72 | + |
| 73 | + 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). |
| 74 | + 2. On the sidebar, click the **Resource list** icon. If the sidebar is not visible, click the **Navigation Menu** icon to the far left of the |
| 75 | + top navigation bar. |
| 76 | + 3. In the list of resources, expand **Databases**, and then click the target watsonx.data data store instance. |
| 77 | + 4. Click **Open web console**. |
| 78 | + 5. If prompted, log in to the web console. |
| 79 | + 6. On the sidebar, click **Infrastructure manager**. If the sidebar is not visible, click the **Global navigation** icon to the far left of the |
| 80 | + top navigation bar. |
| 81 | + 7. Click **Add component**. |
| 82 | + 8. Under **Storage**, click **IBM Cloud Object Storage**, and then click **Next**. |
| 83 | + 9. Complete the on-screen instructions to finish creating the Iceberg catalog. This includes providing the following settings: |
| 84 | + |
| 85 | + - Some display name for the component. |
| 86 | + - The name of the target bucket within the target Cloud Object Storage (COS) instance that you noted earlier. |
| 87 | + - The region for the target bucket, which you noted earlier. |
| 88 | + - The public endpoint for the target bucket, which you noted earlier. For this screen only, be sure to prefix the public endpoint with `https://`. |
| 89 | + - The HMAC access key ID for the target COS instance, which you noted earlier. |
| 90 | + - The HMAC secret access key for the target COS instance, which you noted earlier. |
| 91 | + |
| 92 | + 10. Next to **Connection status**, click **Test connection** to test the connection. Do not proceed until **Successful** is shown. If the connection is |
| 93 | + not successful, check the values you entered for the target bucket name, region, endpoint, access key, and secret access key, and try again. |
| 94 | + 11. Check the box labelled **Associate Catalog**. |
| 95 | + 12. Check the box labelled **Activate now**. |
| 96 | + 13. Under **Associated catalog**, for **Catalog type**, select **Apache Iceberg**. |
| 97 | + 14. Enter some **Catalog name**. |
| 98 | + 15. Click **Associate**. |
| 99 | + 16. On the sidebar, click **Infrastructure manager**. Make sure the catalog is associated with the appropriate engines. If it is not, rest your mouse |
| 100 | + on an unassociated target engine, click the **Manage associations** icon, check the box next to the target catalog's name, and then |
| 101 | + click **Save and restart engine**. |
| 102 | + |
| 103 | + To create an engine if one is not already shown, click **Add component**, and follow the on-screen to add an appropriate engine from the list of available **Engines** |
| 104 | + (for example, an **IBM Presto** engine). |
| 105 | + |
| 106 | +- The catalog name and metastore REST endpoint for the target Iceberg catalog. To get this: |
| 107 | + |
| 108 | + 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). |
| 109 | + 2. On the sidebar, click the **Resource list** icon. If the sidebar is not visible, click the **Navigation Menu** icon to the far left of the |
| 110 | + top navigation bar. |
| 111 | + 3. In the list of resources, expand **Databases**, and then click the target watsonx.data data store instance. |
| 112 | + 4. Click **Open web console**. |
| 113 | + 5. If prompted, log in to the web console. |
| 114 | + 6. On the sidebar, click **Infrastructure manager**. If the sidebar is not visible, click the **Global navigation** icon to the far left of the |
| 115 | + top navigation bar. |
| 116 | + 7. In the **Catalogs** section, click the target Iceberg catalog. |
| 117 | + 8. On the **Details** tab, note the value of **Name** representing the catalog name, and **Metastore REST endpoint** representing the metastore REST endpoint. (Ignore the **Metastore Thrift endpoint** value.) |
| 118 | + |
| 119 | +- A namespace (also known as a schema) and a table in the target catalog. If you do not have these already, create them as follows: |
| 120 | + |
| 121 | + 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). |
| 122 | + 2. On the sidebar, click the **Resource list** icon. If the sidebar is not visible, click the **Navigation Menu** icon to the far left of the |
| 123 | + top navigation bar. |
| 124 | + 3. In the list of resources, expand **Databases**, and then click the target watsonx.data data store instance. |
| 125 | + 4. Click **Open web console**. |
| 126 | + 5. If prompted, log in to the web console. |
| 127 | + 6. On the sidebar, click **Data manager**. If the sidebar is not visible, click the **Global navigation** icon to the far left of the |
| 128 | + top navigation bar. |
| 129 | + 7. On the **Browse data** tab, under **Catalogs associated**, click the target catalog. |
| 130 | + 8. Click the ellipses, and then click **Create schema**. |
| 131 | + 9. Enter some **Name** for the schema, and then click **Create**. |
| 132 | + 10. On the sidebar, click **Query workspace**. |
| 133 | + 11. In the SQL editor, enter and run a table creation statement such as the following, replacing `<catalog-name>` with the name of the target |
| 134 | + catalog and `<schema-name>` with the name of the target schema: |
| 135 | + |
| 136 | + ```sql |
| 137 | + CREATE TABLE <catalog-name>.<schema-name>.elements ( |
| 138 | + "type" varchar, |
| 139 | + "element_id" varchar, |
| 140 | + "text" varchar, |
| 141 | + "file_directory" varchar, |
| 142 | + "filename" varchar, |
| 143 | + "languages" array(varchar), |
| 144 | + "last_modified" double, |
| 145 | + "page_number" varchar, |
| 146 | + "filetype" varchar, |
| 147 | + "url" varchar, |
| 148 | + "version" varchar, |
| 149 | + "record_locator" varchar, |
| 150 | + "date_created" double, |
| 151 | + "date_modified" double, |
| 152 | + "date_processed" double, |
| 153 | + "filesize_bytes" bigint, |
| 154 | + "points" varchar, |
| 155 | + "system" varchar, |
| 156 | + "layout_width" bigint, |
| 157 | + "layout_height" bigint, |
| 158 | + "id" varchar, |
| 159 | + "record_id" varchar, |
| 160 | + "parent_id" varchar |
| 161 | + ) |
| 162 | + WITH ( |
| 163 | + delete_mode = 'copy-on-write', |
| 164 | + format = 'PARQUET', |
| 165 | + format_version = '2' |
| 166 | + ) |
| 167 | + ``` |
| 168 | + |
| 169 | + Note that incoming elements that do not have matching column |
| 170 | + names will be dropped upon record insertion. For example, if the incoming data has an element named `sent_from` and there is no |
| 171 | + column named `sent_from` in the table, the `sent_from` element will be dropped upon record insertion. You should modify the preceding |
| 172 | + sample table creation statement to add columns for any additional elements that you want to be included upon record |
| 173 | + insertion. |
| 174 | + |
| 175 | +- The name of the target namespace (also known as a schema) within the target catalog, and name of the target table within that schema. To get these: |
| 176 | + |
| 177 | + 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). |
| 178 | + 2. On the sidebar, click the **Resource list** icon. If the sidebar is not visible, click the **Navigation Menu** icon to the far left of the |
| 179 | + top navigation bar. |
| 180 | + 3. In the list of resources, expand **Databases**, and then click the target watsonx.data data store instance. |
| 181 | + 4. Click **Open web console**. |
| 182 | + 5. If prompted, log in to the web console. |
| 183 | + 6. On the sidebar, click **Data manager**. If the sidebar is not visible, click the **Global navigation** icon to the far left of the |
| 184 | + top navigation bar. |
| 185 | + 7. On the **Browse data** tab, expand the name of the target catalog, and note the names of the target schema and target table. |
| 186 | + |
| 187 | +- The name of the column in the target table that uniquely identifies each of the records in the table. |
0 commit comments