Skip to content
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions docs/pages/product/configuration/data-sources/databricks-jdbc.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,17 @@ CUBEJS_DB_EXPORT_BUCKET_AZURE_CLIENT_ID=<AZURE_CLIENT_ID>
CUBEJS_DB_EXPORT_BUCKET_AZURE_CLIENT_SECRET=<AZURE_CLIENT_SECRET>
```

#### Google Cloud Storage

To use Google Cloud Storage as an export bucket, follow [the Databricks guide on
connecting to Google Cloud Storage][databricks-docs-uc-gcs].

```dotenv
CUBEJS_DB_EXPORT_BUCKET=gs://my-bucket-on-gcs
CUBEJS_DB_EXPORT_BUCKET_TYPE=gcs
CUBEJS_DB_EXPORT_GCS_CREDENTIALS=<BASE64_ENCODED_SERVICE_CREDENTIALS_JSON>
```

## SSL/TLS

Cube does not require any additional configuration to enable SSL/TLS for
Expand All @@ -173,6 +184,8 @@ bucket][self-preaggs-export-bucket] **must be** configured.
https://docs.databricks.com/data/data-sources/azure/azure-storage.html
[databricks-docs-uc-s3]:
https://docs.databricks.com/en/connect/unity-catalog/index.html
[databricks-docs-uc-gcs]:
https://docs.databricks.com/gcp/en/connect/unity-catalog/cloud-storage.html
[databricks-docs-jdbc-url]:
https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#get-server-hostname-port-http-path-and-jdbc-url
[databricks-docs-pat]:
Expand Down
25 changes: 23 additions & 2 deletions packages/cubejs-databricks-jdbc-driver/src/DatabricksDriver.ts
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,11 @@ export type DatabricksDriverConfiguration = JDBCDriverConfiguration &
* Azure service principal client secret
*/
azureClientSecret?: string,

/**
* GCS credentials JSON content
*/
gcsCredentials?: string,
};

type ShowTableRow = {
Expand Down Expand Up @@ -209,7 +214,7 @@ export class DatabricksDriver extends JDBCDriver {
// common export bucket config
bucketType:
conf?.bucketType ||
getEnv('dbExportBucketType', { supported: ['s3', 'azure'], dataSource }),
getEnv('dbExportBucketType', { supported: ['s3', 'azure', 'gcs'], dataSource }),
exportBucket:
conf?.exportBucket ||
getEnv('dbExportBucket', { dataSource }),
Expand Down Expand Up @@ -246,6 +251,10 @@ export class DatabricksDriver extends JDBCDriver {
azureClientSecret:
conf?.azureClientSecret ||
getEnv('dbExportBucketAzureClientSecret', { dataSource }),
// GCS credentials
gcsCredentials:
conf?.gcsCredentials ||
getEnv('dbExportGCSCredentials', { dataSource }),
};
if (config.readOnly === undefined) {
// we can set readonly to true if there is no bucket config provided
Expand Down Expand Up @@ -643,7 +652,7 @@ export class DatabricksDriver extends JDBCDriver {
* export bucket data.
*/
public async unload(tableName: string, options: UnloadOptions) {
if (!['azure', 's3'].includes(this.config.bucketType as string)) {
if (!['azure', 's3', 'gcs'].includes(this.config.bucketType as string)) {
throw new Error(`Unsupported export bucket type: ${
this.config.bucketType
}`);
Expand Down Expand Up @@ -733,6 +742,15 @@ export class DatabricksDriver extends JDBCDriver {
url.host,
objectSearchPrefix,
);
} else if (this.config.bucketType === 'gcs') {
return this.extractFilesFromGCS(
{ credentials: this.config.gcsCredentials },
url.host,
objectSearchPrefix+".csv",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important to add .csv here? Are there any other unrelated files that might be captured by objectSearchPrefix that should be excluded?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this:

  protected async extractFilesFromGCS(
    gcsConfig: GoogleStorageClientConfig,
    bucketName: string,
    tableName: string
  ): Promise<string[]> {
.......
const [files] = await bucket.getFiles({ prefix: `${tableName}/` });

Does Databricks create a folder for a table with a csv suffix? Like my_exported_table.cvs/ with a bunch of real csv files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this:

gs://bucket-name/
└── table-name.csv/
    ├── _SUCCESS
    ├── _committed_4308966877412207793
    ├── _started_4308966877412207793
    ├── part-00000-tid-4308966877412207793-4aede004-bc41-469a-8d29-7b806221dbf4-140315-1-c000.csv
    ├── part-00001-tid-4308966877412207793-4aede004-bc41-469a-8d29-7b806221dbf4-140316-1-c000.csv
    ├── part-00002-tid-4308966877412207793-4aede004-bc41-469a-8d29-7b806221dbf4-140317-1-c000.csv
    └── part-00003-tid-4308966877412207793-4aede004-bc41-469a-8d29-7b806221dbf4-140318-1-c000.csv

It's kind of weird actually since the folder name has a .csv suffix in it, like table-name.csv/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CREATE TABLE database.schema.table_name
USING CSV
LOCATION 'gs://bucket-name/table-name.csv'
OPTIONS (escape = '"') AS
(
 select ...
)

maybe we don't need the csv suffix in the LOCATION? But i don't have a databricks-aws env to test if it's the same behavior as exporting to GCS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think your're right. We should remove csv posstfix from the location.

).then(files => files.filter(file =>
decodeURIComponent(new URL(file).pathname).endsWith('.csv') ||
decodeURIComponent(new URL(file).pathname).endsWith('.csv.gz')
Comment on lines +753 to +754
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answers to the questions above might affect these lines. Are they needed?

Copy link
Contributor Author

@qiao-x qiao-x Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes they are needed since those files will also be returned from extractFilesFromGCS, which i am not sure if it will cause any problems on the cubestore side (I observed those files in the logs of cubestore as well)

├── _SUCCESS
 ── _committed_4308966877412207793
 ── _started_4308966877412207793

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not be a problem. Anyway - if we want to make some filtering - this should be done in extractFilesFromGCS in BaseDriver for all.

));
} else {
throw new Error(`Unsupported export bucket type: ${
this.config.bucketType
Expand Down Expand Up @@ -769,6 +787,9 @@ export class DatabricksDriver extends JDBCDriver {
*
* `fs.s3a.access.key <aws-access-key>`
* `fs.s3a.secret.key <aws-secret-key>`
* For Google cloud storage you can configure storage credentials and create an external location to access it
* (https://docs.databricks.com/gcp/en/connect/unity-catalog/cloud-storage/storage-credentials
* https://docs.databricks.com/gcp/en/connect/unity-catalog/cloud-storage/external-locations)
*/
private async createExternalTableFromSql(tableFullName: string, sql: string, params: unknown[], columns: ColumnInfo[]) {
let select = sql;
Expand Down