Skip to content

Commit 175ec8e

Browse files
authored
Ingest/UI/API: Delta Tables in Databricks destination connector - automatic table creation (#567)
1 parent ad44446 commit 175ec8e

File tree

4 files changed

+52
-53
lines changed

4 files changed

+52
-53
lines changed

snippets/general-shared-text/databricks-delta-table-api-placeholders.mdx

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,13 @@
88

99
If the target table and volume are in the same schema (formerly known as a database), then `<database>` and `<schema>` will have the same values.
1010

11-
- `<table_name>` (_required_): The name of the target table in Unity Catalog.
11+
- `<table_name>`: The name of the target table in Unity Catalog.
12+
13+
- If a table name is specified, but a table with that name does not exist within the specified schema (formerly known as a database), the connector attempts to create a table with that name within that schema.
14+
- If no table name is specified, the connector attempts to create a table named `u<short-workflow-id>` within the specified schema (formerly known as a database).
15+
16+
See the beginning of this article for additional technical requirements before having the connector attempt to create a table.
17+
1218
- `<schema>`: The name of the schema (formerly known as a database) in Unity Catalog for the target volume. The default is `default` if not otherwise specified.
1319

1420
If the target volume and table are in the same schema (formerly known as a database), then `<schema>` and `<database>` will have the same values.

snippets/general-shared-text/databricks-delta-table-cli-api.mdx

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,12 @@ The following environment variables:
2020

2121
If you are also using a volume, and the target table and volume are in the same schema (formerly known as a database), then `DATABRICKS_DATABASE` and `DATABRICKS_SCHEMA` will have the same values.
2222

23-
- `DATABRICKS_TABLE` - The name of the table inside of the schema (formerly known as a database), represented by `--table-name` (CLI) or `table_name` (Python). The default is `elements` if not otherwise specified.
23+
- `DATABRICKS_TABLE` - The name of the table inside of the schema (formerly known as a database), represented by `--table-name` (CLI) or `table_name` (Python).
24+
25+
- If a table name is specified, but a table with that name does not exist within the specified schema (formerly known as a database), the connector attempts to create a table with that name within that schema.
26+
- If no table name is specified, the connector attempts to create a table named `u<short-workflow-id>` within the specified schema (formerly known as a database).
27+
28+
See the beginning of this article for additional technical requirements before having the connector attempt to create a table.
2429

2530
<Note>
2631
Using dashes (`-`) in the names of catalogs, schemas (formerly known as databases), tables, and volumes might cause isolated issues with the connector. It is

snippets/general-shared-text/databricks-delta-table-platform.mdx

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,13 @@ Fill in the following fields:
1010

1111
If the target table and volume are in the same schema (formerly known as a database), then **Database** and **Schema** will have the same names.
1212

13-
- **Table Name** (_required_): The name of the target table in Unity Catalog.
13+
- **Table Name**: The name of the target table in Unity Catalog.
14+
15+
- If a table name is specified, but a table with that name does not exist within the specified schema (formerly known as a database), the connector attempts to create a table with that name within that schema.
16+
- If no table name is specified, the connector attempts to create a table named `u<short-workflow-id>` within the specified schema (formerly known as a database).
17+
18+
See the beginning of this article for additional technical requirements before having the connector attempt to create a table.
19+
1420
- **Schema**: The name of the schema (formerly known as a database) in Unity Catalog for the target volume. The default is `default` if not otherwise specified.
1521

1622
If the target volume and table are in the same schema (formerly known as a database), then **Schema** and **Database** will have the same names.

snippets/general-shared-text/databricks-delta-table.mdx

Lines changed: 32 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -61,10 +61,28 @@
6161
[GCP](https://docs.gcp.databricks.com/tables/managed.html)
6262
within that schema (formerly known as a database).
6363

64-
<Note>
65-
Using dashes (`-`) in the names of catalogs, schemas (formerly known as databases), and tables might cause isolated issues with the connector. It is
66-
recommended to use underscores (`_`) instead of dashes in the names of catalogs, schemas, and tables.
67-
</Note>
64+
You can have the connector attempt to create a table for you automatically at run time. To do this, in the connector settings as described later in this article,
65+
do one of the following:
66+
67+
- Specify the name of the table that you want the connector to attempt to create within the specified catalog and schema (formerly known as a database).
68+
- Leave the table name blank. The connector will attempt to create a table within the specified catalog and schema (formerly known as a database).
69+
For the [Unstructured UI](/ui/overview) and [Unstructured API](/api-reference/overview), the table is named `u<short-workflow-id>`.
70+
For the [Unstructured Ingest CLI and Ingest Python library](/ingestion/overview), the table is named `unstructuredautocreated`.
71+
72+
The connector will attempt to create the table on behalf of the related Databricks workspace user or Databricks managed service principal that is referenced in the connector settings, as described later in these requirements.
73+
If successful, the table's owner is set as the related Databricks workspace user or Databricks managed service principal. The owner will have all Unity Catalog
74+
privileges on the table by default. No other Databricks workspace users or Databricks managed service principals will have any privileges on the table by default.
75+
76+
<Warning>
77+
If the table's parent schema (formerly known as a database) is not owned by the same Databricks workspace user or Databricks managed service principal that is
78+
referenced in the connector settings, then you should grant the new table's owner the `CREATE TABLE` privilege on that parent schema (formerly known as a database)
79+
before the connector attempts to create the table. Otherwise, table creation could fail.
80+
</Warning>
81+
82+
<Note>
83+
Using dashes (`-`) in the names of catalogs, schemas (formerly known as databases), and tables might cause isolated issues with the connector. It is
84+
recommended to use underscores (`_`) instead of dashes in the names of catalogs, schemas, and tables.
85+
</Note>
6886

6987
The following video shows how to create a catalog, schema (formerly known as a database), and a table in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them:
7088

@@ -78,56 +96,17 @@
7896
allowfullscreen
7997
></iframe>
8098

81-
This table must contain the following column names and their data types:
99+
If you want to use an existing table or create one yourself beforehand, this table must contain at minimum the following column names and their data types:
82100

83101
```text
84-
CREATE TABLE IF NOT EXISTS `<catalog_name>`.`<schema_name>`.elements (
102+
CREATE TABLE IF NOT EXISTS <catalog_name>.<schema_name>.<table_name> (
85103
id STRING NOT NULL PRIMARY KEY,
86-
record_id STRING,
87-
element_id STRING,
104+
record_id STRING NOT NULL,
105+
element_id STRING NOT NULL,
88106
text STRING,
89107
embeddings ARRAY<FLOAT>,
90108
type STRING,
91-
date_created TIMESTAMP,
92-
date_modified TIMESTAMP,
93-
date_processed TIMESTAMP,
94-
permissions_data STRING,
95-
filesize_bytes FLOAT,
96-
url STRING,
97-
version STRING,
98-
record_locator STRING,
99-
category_depth DOUBLE,
100-
parent_id STRING,
101-
attached_filename STRING,
102-
filetype STRING,
103-
last_modified TIMESTAMP,
104-
file_directory STRING,
105-
filename STRING,
106-
languages ARRAY<STRING>,
107-
page_number STRING,
108-
links STRING,
109-
page_name STRING,
110-
link_urls STRING,
111-
link_texts STRING,
112-
sent_from STRING,
113-
sent_to STRING,
114-
subject STRING,
115-
section STRING,
116-
header_footer_type STRING,
117-
emphasized_text_contents STRING,
118-
emphasized_text_tags STRING,
119-
text_as_html STRING,
120-
regex_metadata STRING,
121-
detection_class_prob FLOAT,
122-
is_continuation BOOLEAN,
123-
orig_elements STRING,
124-
coordinates_points STRING,
125-
coordinates_system STRING,
126-
coordinates_layout_width FLOAT,
127-
coordinates_layout_height FLOAT,
128-
partitioner_type STRING,
129-
image_mime_type STRING,
130-
image_base64 STRING
109+
metadata VARIANT
131110
);
132111
```
133112

@@ -208,7 +187,9 @@
208187
></iframe>
209188

210189
- The Databricks workspace user or Databricks managed service principal must have the following _minimum_ set of permissions and privileges to write to an
211-
existing volume or table in Unity Catalog:
190+
existing volume or table in Unity Catalog. If the owner of these is that Databricks workspace user or Databricks managed service principal, then
191+
they will have all necessary permissions and privileges by default. If the owner is someone else, then the following permissions and privileges must be
192+
explicitly granted to them before using the connector:
212193

213194
- To use an all-purpose cluster for access, `Can Restart` permission on that cluster. Learn how to check and set cluster permissions for
214195
[AWS](https://docs.databricks.com/compute/clusters-manage.html#compute-permissions),
@@ -233,7 +214,8 @@
233214

234215
- `USE CATALOG` on the table's parent catalog in Unity Catalog.
235216
- `USE SCHEMA` on the tables's parent schema (formerly known as a database) in Unity Catalog.
236-
- `MODIFY` and `SELECT` on the table.
217+
- To create a new table, `CREATE TABLE` on the table's parent schema (formerly known as a database) in Unity Catalog.
218+
- If the table already exists, `MODIFY` and `SELECT` on the table.
237219

238220
Learn how to check and set Unity Catalog privileges for
239221
[AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges),

0 commit comments

Comments
 (0)